SRAM In AI: The Future Of Memory

Why SRAM is viewed as a critical element in new and traditional compute architectures.

popularity

Experts at the Table — Part 1: Semiconductor Engineering sat down to talk about AI and the latest issues in SRAM with Tony Chan Carusone, CTO at Alphawave Semi; Steve Roddy, chief marketing officer at Quadric; and Jongsin Yun, memory technologist at Siemens EDA. What follows are excerpts of that conversation. Part two of this conversation can be found here and part three is here.


[L-R]: Alphawave’s Chan Carusone; Quadric’s Roddy; Siemens’ Yun.

SE: What are the key characteristics of SRAM that will make it suitable for AI workloads?

Yun: SRAM is compatible with a CMOS logic process, which makes SRAM track logic performance improvement whenever it migrates one technology to another. SRAM is a locally available memory within a chip. Therefore, it offers instantly accessible data, which is why it is favored in AI applications. With decades of manufacturing experience, we know most of its potential issues and how to maximize its benefits. In terms of performance, SRAM stands out as the highest-performing memory solution that we know so far, making it the preferred choice for AI.

Roddy: The amount of SRAM, which is a critical element of any AI processing solution, will depend greatly on whether you’re talking about data center versus device, or training versus inference. But I can’t think of any of those applications where you don’t have at least a substantial amount of SRAM right next to the processing element, running the AI training or inference. Any kind of processor needs some form of SRAM for scratch pads, for local memories, for storing intermediate results. It doesn’t matter whether you’re talking about an SoC that has a reasonable amount of SRAM on chip next to the compute engine, and you go off-chip to something like DDR or HBM to hold the bulk of the model, or whether you’re talking about a giant training chip, dense with hundreds of megabytes of SRAM. In either case you need to have good, fast SRAM immediately next to the big array of multiply accumulate units that do the actual computation. That’s just a fact of life, and the rest of the question is a balancing act. What kind of models are going to run? Is the model going to be big or small? Is this high performance ML or low performance, always-on ML? Then that becomes a question of where the bulk of the activation in the model resides, during the inference or during the training? There’s always SRAM somewhere. It becomes just an architectural tradeoff question based on the particulars.

Chan Carusone: SRAM is essential for AI, and embedded SRAM in particular. It’s the highest-performance, and you can integrate it directly alongside the high-density logic. For those reasons alone, it’s important. Logic is scaling better than SRAM. As a result, SRAM becomes more important and is consuming a larger fraction of that chip area. Some processors have a large amount of SRAM on them, and the trend might be for that to continue, which starts becoming a significant cost driver for the whole thing. We want to integrate as much compute onto these high-performance training engines as possible. It’ll be interesting to see how that’s dealt with as we go forward. One thing you see emerging is a disaggregation of these large chips that are reaching reticle limits into multiple chiplets, with proper interconnects that allow them to behave as one large die, and thereby integrate more compute and more SRAM. In turn, the large amount of SRAM is driving this transition to chiplet-based implementations even more.

Roddy: Whether it’s the data center or a two-dollar edge device, machine learning is a memory management problem. It’s not a compute problem. At the end of the day, you’ve either got massive training sets and you’re trying to shuffle that off chip and on chip back and forth, all day long, or you’re iterating through an inference, where you’ve got a bunch of weights and you’ve got activations flowing through. All the architectural differences between different flavors of compute implementation boil down to different strategies to manage the memory and manage the flow of the weights and activations, which is incredibly dependent upon the type of memory available and chosen. Any chip architect is effectively mapping out a memory hierarchy appropriate to their deployment scenario, but in any scenario, you have to have SRAM.

SE: Will memory architectures evolve as the adoption of CXL expands?

Chan Carusone: There’s a family of new technologies that might give new optimization opportunities for computer architects. CXL might be one. Another is HBM, which could allow for dense, integrated DRAM stacks. There may be implementations, including chiplet-based architectures, as EDA tools and IP become more available to enable those types of solutions. There are all kinds of new knobs that architects have to use that might allow for a mix of different memory technologies for different levels of cache. That’s creating good opportunities for customization of hardware solutions to particular workloads, without requiring a complete from scratch new design.

Yun: CXL is like an evolved version of PCI Express. It offers high-speed communication between devices like CPUs, GPUs, and other memories. They offer some sharing of the cache memories so it allows some communicating and sharing the memory between the device. Using this solution, Samsung recently suggested a near-memory computation within DRAM, which may fill in some of the memory hierarchy after the L3 level and after the main memory level.

Roddy: We’re getting a wider dynamic range of model size now compared to, say, four years ago. The large language models (LLMs), which have been in the data center for a couple of years, are starting to migrate to the edge. You’re seeing people talking about running a 7-billion parameter model on a laptop. In that case, you’d want generative capability baked into your Microsoft products. For example, when you’re stuck on an airplane, you can’t go to the cloud, but you want to be able to run a big model. That wasn’t the case 2 to 4 years ago, and even the models people ran in the cloud weren’t as large as these 70 billion- to 100 billion-parameter models.

SE: What’s the impact of that?

Roddy: It has a dramatic effect on both the total amount of memory in the system and the strategies for staging both the weights and activations at the “front door” of the processing element. For example, in the device area where we work, there’s much more integration of larger SRAMS on-device or on-chip. And then the interfaces, whether it be DDR, whether it be HBM, or something like CXL, people try to figure out, “Okay, I’ve got cold storage because I’ve got my 10-billion parameter models up to flash somewhere, along with all the other elements in my high-end phone.” I’ve got to pull it out of cold storage, put it into “warm storage” off-chip, DDR, HBM, and then I have to quickly move data on and off chip into the SRAM, which is next to my compute element, whether it’s our chip, NVIDIA’s, whatever. That same hierarchy has to exist. So the speed and power of those interfaces become critical to the overall power performance of the system, and strategies for signals will now become critical factors in overall system performance. A few years ago, people were looking at efficiency in machine learning as a hardware problem. Nowadays, it’s more of an offline ahead-of-time compilation software problem. How do I look at this massive model that I’m going to sequence through multiple times, either training or inference, and how do I sequence the tensors in the data in the smartest way possible to minimize interfaces? It’s become a compiler challenge, a MAC efficiency challenge. All the early attempts to build a system out of analog compute or in-memory compute, and all the other esoteric executions, have kind of fallen by the wayside. People now realize if I’m shuffling 100 billion bytes of data back and forth, over and over again, that’s the problem I need to go solve. It’s no,t “Do I do my 8 x 8 multiply with some sort of weird anticipation logic that burns no power?” At the end of the day, that’s a fraction of the overall problem.

Chan Carusone: If SRAM density becomes an issue and limits die size, that may drive different tradeoffs as to where the memory should reside. The availability of new technology tools, like CXL, may percolate up and impact the way the software is being architected and conceived, and the algorithms that might be most efficient for particular applications. That interplay is going to become more interesting because these models are so enormous that proper decisions like that can make a huge difference in total power consumption or cost for implementation of a model.

SE: How does SRAM help balance low power and high performance in AI and other systems?

Chan Carusone: The simple answer is that having embedded SRAM allows for quick data retrieval and less latency required to get the computations going. It reduces the need to go off chip, which is generally more power hungry. Every one of those off-chip transactions costs more. It’s the trade-off between that and filling your chip with SRAM and not having any room left to do logic.

Roddy: The scaling difference as you move down the technology curve between logic and SRAM interplays with that other question about management, power, and manufacturability. For example, there’s a lot of architectures for AI inference or training that rely upon on arrays of processing elements. You see a lot of dataflow type architectures, a lot of arrays of matrix calculation engines. Our architecture at Quadric has a two-dimensional matrix of processing elements, where we chunk 8 MACs, some ALUs, and memory, and then tile and scale that out — not too dissimilar from what people do in GPUs with numerous shader engines or various other dataflow architectures. When we did the first implementation of our architecture, we did a proof-of-concept chip in 16nm. Our choices about how much memory to put next to each of those compute elements was rather straightforward. We have a 4k-byte SRAM next to every one of these little engines of MACs and ALUs, with that same block of logic, organized as 512 by 32 bits. When you scale down, suddenly you look at 4nm, and you think, let’s just build that with flops, because the overhead of having all the SRAM structure didn’t scale as much as the logic did. At 4nm, does the processor designer need to think, “Do I change the amount of resources in my overall system at that local compute engine level? Should I increase the size of memory in order to make it a useful size of an SRAM? Or do I need to convert from SRAM over to traditional flop-based designs?” But that changes the equation in terms of testability, and fit rates, if you’re talking about an automotive solution. So a lot of things are at play here, which is all part of this hierarchy of capability.

The entire picture the solution architect needs to understand demands a lot of skills, such as process technology, efficiency, memory, and compilers. It’s a non-trivial world, which is why there is so much investment pouring into this segment. We all want these chatbots to do marvelous things, but it’s not immediately obvious what is the right way to go about it. It’s not a mature industry, where you’re making incremental designs year after year. These are systems that change radically over two or three years. That’s what makes it exciting — and also dangerous.

Chan Carusone: TSMC’s much publicized FinFlex technologies can provide another avenue for trading power versus performance leakage versus area. Another indication is people are talking about 8T cells now instead of 6T cells. Everyone’s pushing these designs around, exploring different parts of the design space for different applications. All that R&D investment is illustrative of the importance of this.

Yun: Using a flip-flop for a memory is a great idea. We can read/write faster, because register file does the flip much faster than the L1 cache memory. If we use that, it would be the ultimate solution to improve performance. And in my experience, a register file is much more robust for dealing with transient defects than SRAM because of its stronger pull-down and pull-up performance. It’s a very nice solution if we have a huge number of cores with tiny memories, and those memories in a core made by the register file. My only concern is the register file uses a bigger transistor than the SRAM, so the standby leakage and the dynamic power are much higher than that of SRAM. Would there be a solution for that additional power consumption when we use the register file?

Roddy: Then you get into this question of partitioning register files, clock gating, and power-downs. It’s a compiler challenge, the offline ahead-of-time compile, so you’ll know how much of a reg file or memory is being utilized at any given point in time. If you build it in banks, and you can turn it off, you can mitigate those kind of problems, because for certain portions of a graph that you’re running in machine learning you don’t need all of the memory. And for other portions you do need all that memory to power things up and down. We’re getting into a lot of sophisticated analysis of the shapes and sizes of the tensors, the locality of the tensors. The movement of the tensors becomes a large ahead-of-time graph compilation problem, and not so much an optimization of the 8 x 8 multiplication or floating point multiplication. It’s still important that there’s another layer above that is a higher leverage point. You get more leverage early by optimizing the sequencing of operations, than you do by optimizing the energy efficiency delay once you’ve already scheduled it.

Further Reading
The Uncertain Future Of In-Memory Compute Part 2 of the above conversation.
The answer may depend on whether SRAM can shrink further.
DRAM Choices Are Suddenly Much More Complicated
The number of options and tradeoffs is exploding as multiple flavors of DRAM are combined in a single design.



Leave a Reply


(Note: This name will be displayed publicly)