Adaptive Clocking: Minding Your P-States And C-States

Avoid downtime during large-scale power state changes by allowing clock generators to switch between frequencies in response to voltage changes.

popularity

Larger processor arrays are here to stay for AI and cloud applications. For example, Ampere offers a 128-core behemoth for hyperscalers (mainly Oracle), while Esperanto integrates almost 10x more cores for AI workloads. However, power management becomes increasingly important with these arrays, and system designers need to balance dynamic power with system latency. As we march year over year, transistor savings per process node jump are getting overtaken by design complexity, as figure 1 shows.

(Data Source: TechInsights)

Most data center solutions have a comprehensive power state table with varying techniques to reduce active power consumption. These techniques range from pausing execution on a core to full sleep mode. In between these extremes, processors will shut down their clock and flush successive cache levels.

As a processor progresses into a deeper sleep state, its exit latency increases (the time to return to active execution), which will also reduce overall throughput. The compute unit must remain in its conservation state for more time than its exit latency to remain power effective. The actual downtime varies and increases as the exit latency increases, as figure 2 shows.

Data Source: Mete Balci

Data center and AI processors are only getting more complex to meet the end-user’s needs. Modern data centers handle standard virtualization tasks and are seeing an increasing number of AI-related workloads. Diving a little deeper, AI workloads by themselves are becoming more complex because model makeup is changing at such a rapid pace. Without a predictable slew of operations, cores would have to stay in active execution mode or a very low-latency conservation state (such as C1 or C2), limiting the power savings. Conversely, a processor could raise its frequency at the cost of its dynamic power consumption. Mainstream server processors from Intel and AMD use a boost mode to boost throughput for larger workloads transiently.

While boosting frequency or entering a conservation state may seem directly opposite, they actually have a similar effect on the clock network, which is the first domino in determining exit latency. As a compute core exits a power-saving state or enters boost mode, it increases the load on the processor. As many compute cores make these transitions, the strain increases proportionally.

The clock network is sensitive to voltage droop (power supply fluctuations). A traditional PLL needs to halt operations and lock at a lower frequency, then halt again and relock at a higher frequency during these fluctuations. As you can imagine, it significantly contributes to latency during power state changes.

To remove the lock and relock procedure and minimize system latency, large array processors must adopt adaptive clocking. The technique instantly allows a clock generator to switch between frequencies in response to voltage changes. By cycling through frequencies, system designers can avoid tens (if not hundreds) of cycles of downtime during large-scale power state changes, which significantly boosts power efficiency and improves inferences per second or batch training time.

As chips get larger and larger with ballooning TDPs, adaptive clocking will become a must-have. Without adaptive clocking, data centers will have a higher than expected total cost of ownership through rough power state control. A shift towards this technique has the potential to raise chip performance while properly managing power.



Leave a Reply


(Note: This name will be displayed publicly)