IBM’s Energy-Efficient NorthPole AI Unit

No central memory, no off-chip memory, no von Neumann bottleneck.

popularity

At this point it is well known that from an energy efficiency standpoint, the biggest bang for the back is to be found at the highest levels of abstraction. Fitting the right architecture to the task at hand i.e., an application specific architecture, will lead to benefits that are hard or impossible to claw back later in the design and implementation flow.  With the huge increase in the interest of running AI models the general-purpose CPU, while quite flexible, has not been able to keep up with GPU and TPU machines.  At last month’s Hot Chips 2023 conference, Dr. Dharmendra S. Modha an IBM Fellow from IBM’s Almaden Research Center presented IBM’s NorthPole Neural Inference Machine. It is based on an architecture that’s inspired by the architecture of the human brain.

This work traces its origins back to 2004 with the original project concept and a progression of research that has picked up numerous grants and awards, like the ACM Gordon Bell Prize in 2009 and an induction into the computer history museum in 2016 and then from 2018 it was publicly quiet, but there was a lot of research still being carried out.

NorthPole is a core-based architecture with the cores tiled in 2 dimensions. Inside the NorthPole core is a Vector Matrix Multiplication Engine capable of 8-, 4- and 2-bit precision operations at 2048, 4096 and 8192 Ops/cycle respectively.  It can also provide mixed precision or the “right” precision needed for each layer. The output feeds a Vector Compute Unit capable of 256 FP16 ops/cycle that feeds an “Activation” Function Unit capable of 32 FP16 ops/cycle. The operation is fully pipelined.

Moving data can be very costly from an energy standpoint. Depending on the application and architecture, more energy can be spent on moving data than on performing the actual computational portion of the workload, so memory is closely coupled with compute and placed within only a few microns. The VMM and Vector Unit both have their own private memories. There’s also 768KB/core of unified memory for storing weights, programs, and neural activations.

Control consists of 8 independent threads per core, synchronized by construction. There is no data dependent conditional branching and there are no misses, no stalls and there’s no speculative execution. Every Joule of energy used in the system is well targeted for performing the necessary computation of the result while minimizing the movement of data. Figure 1 below shows the core makeup of the VMM, VCU and memory and how the memory becomes intertwined with the compute logic in the layout.

Figure 1. Brain Inspired Architecture and Layout

Fig. 1: Brain inspired architecture and layout.

To connect the fragmented memory in the layout, 4 NoCs are used. The Activation NoC (ANoC) is inspired by the long, white matter pathways in the brain and the Partial Sum NoC (PNoC) by the shorter grey matter pathways. This is shown in the upper righthand corner of Figure 1. above.

The silicon optimized NoCs (3. and 4. in the above figure) include the Model NoC (MNoC) for moving weights around the chip and the Instruction NoC (INoC) for moving programs around the chip. This architecture enables the reconfigurability needed to marry the brain inspired architecture to a silicon implementation, where the brain operates at ~10Hz while silicon can run up to ~5GHz. The NoC implementation means that there are 4096 wires crossing each core in both dimensions.

Figure 2 below shows the chip layout for an implementation in a 12nm technology. It’s arranged as a 16×16 array of cores with 192MB of memory and 32MB framebuffer for IO tensors. The chip is 800mm2 with 22 billion transistors and was fully functional at first silicon.

Figure 2. 12nm Implementation Layout

Fig. 2: 12nm implementation layout.

NorthPole is a unique architecture in that it has no centralized memory, no off-chip memory and no von Neumann bottleneck. Dr. Modha claimed that many of today’s AI architectures are actually layer accelerators.  Each layer must be loaded onto the accelerator and then loaded out to the off-chip memory or cache. NorthPole is a network computer where the entire network is loaded on to the chip and then a simple 3-step process occurs of write tensor, run, get tensor and essentially the chip performs as an active memory. Figure 3 shows the difference between the layer accelerator approach and the NorthPole architecture in performing AI calculations.

Figure 3. Architectural Compute Comparisons

Fig. 3: Architectural compute comparisons.

The difference between NorthPole and other popular compute architectures and implementations in terms of the separation between memory and compute logic is shown in figure 4 below. While there are clear separations between compute and memory in the layouts of the other architectural implementations, it’s clear that the memory and logic are more tightly intertwined in NorthPole which should significantly improve its energy efficiency.

Figure 4. Layout Comparison of Compute and Memory for Different Architectural Styles

Fig. 4: Layout comparison of compute and memory for different architectural styles.

So how well does NorthPole perform compared to other architectures on AI benchmarks? The answer from a latency and energy efficiency standpoint is quite well. Figures 5 and 6 show a comparison between NorthPole and other popular products on ResNet-50. Figures 7 and 8 show similar comparisons for Yolo-v4. NorthPole is implemented in 12nm technology and is significantly outperforming other designs implemented in 4nm technology.  Compared to a 12nm GPU implementation, NorthPole has ~2500% higher energy efficiency and Dr. Modha said that it’s an example of how “Architecture trumps Moore’s Law” and that there’s still significant room for scaling improvements with NorthPole. Figure 9 shows a diagram and information for scale out to server applications.  The presentation also included information on AI algorithms and software support for NorthPole. In all, a very impressive accomplishment by the IBM team.

Figure 5.ResNet-50 Efficiency Comparison

Fig. 5: ResNet-50 efficiency comparison.

 

Figure 6. ResNet-50 Throughput and Latency Comparison

Fig. 6: ResNet-50 throughput and latency comparison.

 

Figure 7. Yolo-v4 Efficiency Comparison

Fig. 7: Yolo-v4 efficiency comparison.

 

Figure 8. Yolo-v4 Throughput and Latency Comparison

Fig. 8: Yolo-v4 throughput and latency comparison.

 

Figure 9. Prototype NorthPole Scale-Out Assemblies in a Server

Fig. 9: Prototype NorthPole scale-out assemblies in a server.



Leave a Reply


(Note: This name will be displayed publicly)