Timing Challenges In The Age Of AI Hardware

Large size, physical reuse, and signal propagation behavior pose timing signoff challenges for AI chips.

popularity

In recent years, we have seen a clear market trend towards dedicated integrated circuits (ASICs) that are much more efficient in performance and energy consumption than traditional general-purpose computers for processing AI workloads. These AI accelerators harden deep learning algorithm kernels into circuits, enable higher data ingestion bandwidth with local memory, and perform massively parallel computation with numerous cores. Their hugely disruptive potential has triggered explosive growth at both HPC data centers on the cloud and various inferencing applications at the edge, including GPUs with tensor units, SoCs with dedicated NPUs, CPUs augmented with FPGA accelerators, and ASIC designs dedicated to AI computation kernels such as TPU, IPU, WSE, etc. [1] [2]

AI chip design poses unique challenges in architectural exploration, power estimation, layout optimization, and more. In this article, we will look at common challenges of AI chips for timing analysis and signoff: 1) the raw scales of design; 2) the extreme physical reuse of compute cores; 3) the signal propagation behavior in highly regular computation, and 4) the complexities that arise from logic redundancy and low power techniques.

1. The raw size of AI chips

AI chips – particularly those dedicated to accelerating model training in HPC data centers – are often huge in geometric size compare to traditional chips (e.g. mobile SoC, networking chips, GPU, CPU etc.) [3].


Fig. 1: Scale of dedicated AI hardware

The leading AI chips’ die sizes often approach (or are restricted by) the stepper reticle limits in manufacturing processes. They often exceed 500 million instances at 12nm, >1 billion instances at 7nm, >2 billion instances at 5nm. These chips are all implemented with a hierarchical methodology because their scale far beyond the capacity limits of physical design tools. When it comes to timing analysis and verification, designers often still prefer a full-chip view to signoff with confidence. The verification of these largest AI chips – almost ironically – requires the most powerful general computing hardware, such as 2-4TB of memory, 32-64 cores and multiple days to turnaround. To avoid this flow bottleneck jeopardizing design power, performance, and area (PPA) and time-to-market (TTM) requirements, analysis engines must be extremely scalable while making zero QoR compromise.

2. Extreme physical reuse of computing cores

Though very large in scale, AI chips, particularly for data center applications, are often very regular in architecture and the majority of the chip is constructed as arrays of compute cores (or tiles). These compute cores are dedicated to specific algorithm kernels and relatively small (up to a few 100K cells each). When duplicated hundreds or thousands of times, the entire system becomes extremely large. This repetition pattern is often recursive: a few more tightly coupled local tiles are grouped to form larger physical blocks (e.g. super-blocks), and these super-blocks are further duplicated to create the entire chip.


Fig. 2: Recursive hierarchical construction of ultra-large AI ASICs

In some recent AI chip architectures, this repetition can cover 80% or even >90% of the chip’s logic. These tiles and super-blocks are physical clones of each other and easily folded together during implementation. Still, in timing verification, due to factors such as process OCV, clock distribution skews and uncertainties, power supply variations and thermal fluctuations, each instance of the tile needs to be analyzed for its unique boundary condition. Yet, any optimization (ECO) must be commonly applied to all instances in order to maintain the physical clone. This requires a highly intelligent analysis engine capable of managing the MIM repetition and recognizing the uniqueness of boundary conditions.

3. Signal propagation behavior in pipeline computation

Multi-input switching (MIS) is not a new effect, but it has become critically important to AI chips thanks to these ASICs’ architectural characteristics and payload patterns. AI chips stream large amounts of data across processing tiles at application level to perform massively parallel and very regular computations with extremely high throughput. This translates into parallel signals propagating through the shallow logic cells on the data paths at very high speed at hardware level. When different signals switch simultaneously at multi-input gates (AOI, NAND, etc.), delay of an individual signal significantly speeds up or slows down. This effect can cause serious timing issues particularly for hold analysis [4]. However, library characterization of multiple input standard cells often sensitizes individual inputs while assuming other inputs remain static. Timing analysis engines must detect the overlapping signals and invoke additional calculations with given libraries to account for MIS effects whenever and wherever it happens to ensure safe signoff without incurring massive pessimism.


Fig. 3: Increasing MIS effects: a) data Mux (32:1) typical in AI computation b) speed-up effects of typical standard cells

4. Complexities from redundancy and low-power

One of the main motivations to have reconfigurability built into AI chips is product yield. Individual AI chip die size at reticle limit pose significant manufacturing challenges. Sources of local defects become frequent enough and call for built-in redundancies to ensure overall functioning products. The redundancies can be in data and clock paths or even at the entire tile/core level. Circuits may be reconfigured via hardware or software controls to bypass defect parts [5]. On the neuromorphic AI computation front, reconfigurability is realized through extensive fine-grain interconnects among different neurons (cores/tiles). These parallel and side-way data and clock paths increase the complexity of static analysis because it’s impossible to permutate all configurations in verification.


Fig. 4: Reconfigurable design through redundancy

On the other hand, low-power, low-energy are critical for AI inferencing applications. Extensive clock-gating, data-gating, many voltage domains, near-threshold supplies, and IR drops bring additional complexity to timing analysis. The ability to analyze and verify these effects simultaneously with signoff accuracy is critical.

Conclusion

As process and design technologies advance, AI designs and applications evolve and converge. We expect new, interesting model and analysis problems to continue to arise. Scalable and future-proof timing analysis and signoff solutions need to be designed with AI hardware’s architectural traits in mind. These architecturally-aware solutions are vital in tackling both today’s and tomorrow’s challenges, ensuring a risk-free transition from innovative concepts to working systems.



Leave a Reply


(Note: This name will be displayed publicly)