AI demands push innovation in design architectures and techniques.
This year’s HotChips conference at Stanford was all about artificial intelligence (AI) and machine learning (ML) and what particularly struck me, naturally because we’re in this business too, was how big a role on-chip networks played in some of the leading talks. NVIDIA talked about their scalable mesh architecture, both on-chip and in-package, meshes connecting processing NN processing elements with multicast support at the chip-level and connecting these chips at the package level with configurable routing. Huawei talked about their Da Vinci AI cores for both inference and training use cases. Both Xilinx and Intel spoke about their NoC-based systems-on-chip (SoC). And Google talked about network-based scalability in their TPU-v3 allowing them to scale performance linearly (with more boards and racks) to over 100 pflops.
It was also clear from the conference that AI is driving some huge chips, the largest being Cerebras’ 1.2 Trillion transistor 46,225 square mm wafer-sized chip, and that interconnect topologies and more distributed approaches to caching are becoming fundamental to making these designs work with acceptable throughput and power. The AI community has been on-board with these concepts for quite a while, judging from our work with a number of well-known companies around the world. Arguably, AI demands are pushing innovation in design architectures and techniques harder than any other aspect of digital design today.
Another thing that was clear is that, while some aspects of architecture are common (SIMD/VLIW to meshes, rings and tori), the variety is still significant, with multiple detailed architectures and scalability objectives ranging from in-accelerator, on-chip and in-package to on-board, rack and beyond. In the mobile world this extends to a lot of experimentation in what AI processing goes on the edge, in base-stations/access points (the fog), and in the cloud/datacenter.
Of course, this is in part about performance but also about power. In the mobile world it’s also what new capabilities and business opportunities AI can create in the fog, say for network slicing in 5G. Partitioning has become a much more distributed problem, for latency, privacy, maintainability, funding the infrastructure, and more. Boundaries will certainly settle at some point, but where is not yet clear.
At each level, on-chip design is heavily driven by performance per milliwatt, requiring significant attention to data locality in AI accelerators through local memory/cache per processing element or tile, global buffers within the accelerator and even off-die HBM2 memory for faster working storage for big data sets. Accelerators must also often connect to the main SoC control coherently since they share image data with other forms of image processing.
All of these chips are getting bigger, so on-chip NoC interconnects become even more important in managing bandwidth effectively (think about those huge video/image streams). And since some of the accelerators are themselves huge, they also depend on networks within the accelerator – not just NoCs but coherent NoCs providing all the cache coherence management between those accelerator caches that you normally associate with CPU clusters.
Another point on caching. This isn’t general purpose compute; architects and designers know how their algorithms work, so if there are any tricks they can play (prefetch, etc) at any level of cache, they will, if that cache is sufficiently configurable to support those tricks.
Finally, as if this all weren’t complex enough, many of these devices go into safety-critical applications like your car. Here there are two aspects to safety: The functional safety considerations covered by ISO 26262 which have been pretty well characterized and adopted by the industry, and the system level considerations which are much more complex – such as the system correctly recognizing a pedestrian or a traffic light or whatever in an acceptably high number of cases.
The ISO/PAS 21448 SOTIF spec aims to give guidance in this area. I have also seen work being done on the UL 4600 spec (led by CMU autonomous driving researcher Phil Koopman), and also the IEEE P2020 working group aiming to define a set of key performance indicators for imaging reliability in areas such as LED flicker susceptibility and contrast detection probability. There’s probably a lot of room to further pin down and quantify what we mean by safety at higher-than-the-chip levels.
Meanwhile, giant leaps are being made in supporting new AI architectures, tuning them for optimum performance per milliwatt and embedding them effectively into traditional and novel SoC architectures. You can learn more by reading my white paper.
Leave a Reply