Memory Architectures In AI: One Size Doesn’t Fit All

Comparing different machine learning use-cases and the architectures being used to address them.

popularity

In the world of regular computing, we are used to certain ways of architecting for memory access to meet latency, bandwidth and power goals. These have evolved over many years to give us the multiple layers of caching and hardware cache-coherency management schemes which are now so familiar. Machine learning (ML) has introduced new complications in this area for multiple reasons. AI/ML chips can differ significantly in power demand, in latency requirements, in dedicated safety-critical applications and in new layout challenges for high-performance applications, between different edge classes of application and datacenter applications.

A good way to contrast these differing needs is to compare use-cases and the architectures to address those use cases for Movidius, Mobileye and a datacenter application such as the Baidu Kunlun family. (Baidu hasn’t released architecture details, but I know they use our AI package, which reveals important characteristics of their architecture.)

Example #1: Movidius Myriad X
First, consider sample use cases for the Movidius neural compute stick: Detecting harmful bacteria in water, screening for skin cancer or scanning for illegal images of children on the Internet. All highly valuable applications demanding good (even real-time) performance but not latency-critical. Running on a battery is useful in some contexts though not always essential. Architecture is general-purpose (for AI applications), based on multiple (Shave) SIMD processors. You could think of these as a scalable set of accelerators, each with their own local memory abutted to the logic. Obviously, these have to communicate with a common pool of memory to share data (they’re all working on the same image), but rather than use cache coherency between accelerator cores  the architecture implements a lightweight mailbox / message queuing scheme for core-to-core communication along with a shared L2 cache located prior to the DDR memory controller. External memory is generally going to be some flavor of LPDDR4. [Picture source: Intel / Movidius]

Movidius on-chip accelerators communicate directly to each other and to the main memory space using their on-chip “Intelligent Memory Fabric” and 128-bit AMBA AXI communications. The FlexNoC interconnect IP is a perfect solution for this type of non-coherent communication.

Example #2: Mobileye EyeQ
Now consider Mobileye use-cases. If you know anything about Mobileye, you know they are pioneers in automotive ADAS. If you’re a car geek, you also know that they are also making a play for the central sensor fusion and planning brain in the car with their EyeQ5 design, along with their established vision processing role. Here, latency is much more important, as is power. We might think that a great big 12V battery (or many batteries in an EV) make this a non-problem, but not so. Electronics power demand and dissipation / cooling requirements in modern cars are a real concern. Mobileye software is much more highly-tuned to specific ADAS applications and resulting performance is much higher (24 TOPS for EyeQ5 versus around 4 TOPS for Movidius Myriad X). With Mobileye, software is packaged with the hardware, whereas with Movidius you can develop your own software. Because the hardware is customized for the use cases and software, the Mobileye architecture also implements more types of accelerators, for dedicated vision-related, AI and algorithmic functions, than Movidius. And processing elements may each be supported by tightly-coupled memories embedded in those accelerators, as well as a very complex memory architecture with multiple cache coherent “islands” and built-in hardware functional safety mechanisms (learn more here).

Mobileye-style chips, as more complex systems (more accelerators and more types of accelerators), will support more than one level of internal connectivity and caching. Efficient memory accesses must first be managed locally and then back into the system domain, requiring caching and hardware coherency management through the interconnect at each level. Here our Ncore interconnect and proxy caches can be used both inside the accelerator and in connecting to the system domain. [Picture sources: Mobileye / Intel]

Example #3: Datacenter
For datacenter applications, performance is everything and power is not as big a concern as on the “edge.” For example, a liquid-cooled Google TPUv3 consumes around 200 W while an EyeQ5 consumes 10 W. The current trend for datacenter AI chips is flexibility:  You need to handle training as well as inference and you need to be multi-purpose, running multiple types of neural net algorithms. Which means full word-length, floating-point processing, where at the edge you want fixed-point and short word-lengths to minimize power. Getting the ultimate in performance leads to mesh or ring architectures of homogenous processing elements or subsystems, usually with smaller closely-coupled memory or caches associated with each “node.” As these “regular topology” designs scale up in size, architects often organize multiple processing elements into “tiles” with each tile having its own shared memory or cache. Tiling often helps reduce complexity in both software development and back end layout of the chip and the memory architecture within a tile can be non-coherent or cache coherent. [Picture sources: Google]

AI platforms: Scaling up (and down)
The architecture of AI chips is becoming part of a “platform strategy,” where it is common to hierarchically scale-up the number of accelerators in higher-performance product versions, needing multiple proxy-cache access points to the coherent system. Ncore is architected to easily support this scale-up (or scale-down), minimizing need for major redesign on product derivatives/follow-ons. This also makes it easy to support multiple islands of accelerators + proxy-caches, allowing for more aggressive power management without compromising performance when needed.

Ncore cache coherent interconnect also supports its own last level cache (of which, again, you can have more than one), also coherent with the rest of the system. So, the architect has full support to minimize main memory accesses through caching, minimizing power and optimizing latency and bandwidth in her AI accelerators. Wait – can’t I do all of this using other cache coherent interconnect IP? Not so fast. Most cache interconnects are designed for physically tightly-coupled systems (like closely-placed CPU clusters); great for those systems but wildly inefficient for cross-SoC communication in a highly distributed architecture with heterogenous processing clusters located all over the chip. That’s why everyone designing reasonable-sized SoCs uses NoCs. Another issue is managing communications between a non-coherent accelerators and the cache coherent subsystem. The proxy caches accomplish this, in essence allowing clusters of non-coherent hardware accelerators to participate as full coherent citizens in the cache coherent domain along with traditional CPU and GPU cluster.

HBM2, Chiplets and CCIX
Datacenter accelerators, given their need for speed, want to work with dedicated main memory for performance. This might be an HBM2 stack, though Chinese design teams seem to prefer GDDR6 for its lower cost and availability. Given the spatial structure of the mesh, it therefore is necessary to put memory access controllers around the structure. That creates a new problem – connecting all those controllers around that huge mesh to the interface to the ultimate HBM2 controller, without compromising latency and bandwidth.

We handle this (and other capabilities) through our FlexNoC 4 AI package. The built-in HBM2 interleaving supports up to 16 channels between targets and initiators, up to 1024 bits wide, managing traffic aggregation, data width conversions and response reordering to the HBM2 controller. However, it’s worth pointing out that these meshes are already so big that they are bumping into reticle-size limitations. And they’re still not as big as the design teams want to go. I’m hearing a lot of interest in multi-die meshes / chiplets to get past this problem. Of course, those will have to be coherent, which means they’ll have to connect through coherent inter-chip interfaces like CCIX.

“Systems companies” are making AI chips
One new challenge I often see is that there are very large “systems companies” who are building many of these complex AI chip designs. These are not traditional semiconductor companies, so they generally go to 3rd-parties for physical implementation. That’s creating some interesting problems. How do you optimally manage latency and bandwidth tradeoffs when you as a system designer are architecting these complex structures from a logical / RTL standpoint, yet you want to ensure the outsourced back-end team doesn’t negate the advantage you thought you would get because they implemented things “wrong” from a physical standpoint? Not as much a problem for the traditional semiconductor vendors who do front-end and back-end design in-house, but perhaps this will entail more hand-holding than system houses expected.

Sources:
Mobileye CES 2019 presentation, “The State of AV/ADAS at Mobileye/Intel”: https://s21.q4cdn.com/600692695/files/doc_presentations/2019/01/Mobileye_CES2019.pdf
Intel / Movidius Myriad X Product Brief: https://newsroom.intel.com/wp-content/uploads/sites/11/2017/08/movidius-myriad-xvpu-product-brief.pdf
Intel and Mobileye Autonomous Driving SolutionsProduct Brief: https://newsroom.intel.com/wp-content/uploads/sites/11/2018/06/intel-mobileye-pb.pdf
“Tearing Apart Google’s TPU 3.0 AI Coprocessor”, https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/
“SHAVE v2.0 – Microarchitectures – Movidius”, https://en.wikichip.org/wiki/movidius/microarchitectures/shave_v2.0



Leave a Reply


(Note: This name will be displayed publicly)