For AI Hardware, Power Optimization Starts With Software And Ends At Silicon

The new era of artificial intelligence hardware calls for efficient software to silicon power analysis and optimizations.

popularity

Artificial intelligence (AI) processing hardware has emerged as a critical piece of today’s tech innovation. AI hardware architecture is very symmetric with large arrays of up to thousands of processing elements (tiles), leading to billion+ gate designs and huge power consumption. For example, the Tesla auto-pilot software stack consumes 72W of power, while the neural network accelerator consumes 12W (Source: The Verge). A recent study from Stanford has shown that building and training a complex neural network can lead to up to 78,000 pounds of carbon emissions (the equivalent of flying 60 passengers from San Francisco to New York). Designing for efficient energy consumption for AI has become critical, not only to reduce the cost of running farms and improve battery life, but also for the preservation of our planet.

The challenge of optimizing AI power necessitates a comprehensive approach, which includes 1) analyzing software and hardware together with the goal to optimize both, 2) defining the best possible architecture and power management, 3) obtaining early total and glitch power at the RTL stage to identify the best micro-architectures, 4) making power a cost function during implementation, and 5) performing efficient power and signal integrity signoff.

1. System-level power analysis, or how to define the best architecture for AI hardware

System-level architecture is key to identifying the best architecture for maximum performance and lower power. Due to intense tile-to-tile traffic when the algorithm on the AI hardware is run and the huge amount of switching activity happening synchronously, it is critical to analyze the execution of the software application on the hardware model to define the best software and hardware architecture to spread the switching activity. Techniques include clock spreading, distributing memory access over time, developing better DVFS, improving power shutdown schemes, and optimizing power management strategies.


Example: Power vs. performance vs. energy trade-off analysis
Source: Synopsys

2. Power profiling of software and hardware using emulation

Another way to analyze the power of a tile in the context of the full chip and software is to use emulation. Emulation enables the user to do power analysis when the real workload (up to billions of cycles) is run on the chip and identify windows of interest for di/dt, peak power or average power analysis. Due to the large number of MAC operations per cycle, identifying these windows is critical for IR drop and peak power analysis. Emulation quickly obtains a power profile of the workload and provides feedback to the software and hardware engineers; for example, it can allow users to identify any power leaking during the tile-to-tile operation that can be turned off by changing the software to disable hierarchical clock gating, for example.

3. Early power analysis and optimization at RTL

Due to the symmetric and replicated architecture of AI hardware, it is very important to identify the best possible micro-architecture, clock gating, memory gating or data gating for the tile at the RTL stage. Reducing power for a highly replicated tile will lead to high-energy savings at chip level. This is enabled by physically aware RTL power analysis that can provide early but accurate power estimates (typically within 10% of signoff). RTL power analysis in turn enables fast what-if analysis to identify the best micro-architecture and provide guidance on how to improve clock gating efficiency and memory access rate. Additional data gating at this stage can lead to up to 25% power savings for an AI processing tile.

4. Glitch power – A significant concern for AI-style designs

Due to the huge number of operations performed when the AI algorithm is run on hardware, glitch power has become critical for power consumption. Glitch power can represent up to 40% of the total power. Typically, glitch power is computed very late in the flow when gate level simulation with timing delays is available. This is too late to perform changes to the micro-architecture, take glitch power into consideration as part of power costing during implementation, or perform specific ECOs to reduce glitch power.


Percentage of glitch power vs total power for different designs
Source: Synopsys

More novel approaches are available to anticipate glitch power accurately from RTL or 0 delay simulation. This approach enables estimating glitch power within 5 percent of signoff very early in the flow, driving better design decisions during RTL development and better power costing during implementation and ECO, and drastically reducing glitch power.


Early glitch estimator combinational power results within 5 percent of GLS
Source: Synopsys

5. Final chip-level power signoff

The last step is to signoff for power and IR drop. The main challenge is the size of the design and the number of cycles to analyze. This problem can be resolved by massively parallelizing the analysis workloads, while leveraging both on-premise and cloud resources that may be available. Chip-level signoff analysis can be further sped up by leveraging reuse of tile-based power analysis. For IR drop analysis, vectorless techniques can be used to generate vectors that achieve the maximum instantaneous peak power or maximum IR drop.

Conclusion

Powering modern and future AI hardware must start with understanding the software. A comprehensive design solution for AI power establishes an intrinsic connection with the micro-architecture early in the design process and provides the framework to follow through to design completion and final signoff, minimizing risk for late-stage surprises.



Leave a Reply


(Note: This name will be displayed publicly)