Difficult Memory Choices In AI Systems

Tradeoffs revolve around power, performance, area, and bandwidth.


The number of memory choices and architectures is exploding, driven by the rapid evolution in AI and machine learning chips being designed for a wide range of very different end markets and systems.

Models for some of these systems can range in size from 10 billion to 100 billion parameters, and they can vary greatly from one chip or application to the next. Neural network training and inferencing are among the most complex workloads today, making it that much more difficult to come up with an optimal memory solution. These systems consume huge amounts of compute resources — principally using multiply-accumulate math — as well as vast amounts of memory bandwidth. And any slowdown in one area can have repercussions across the entire system.

“A litany of techniques are currently deployed to reduce neural network model complexity and neural network model memory requirements,” said Steve Roddy, vice president of product marketing for Arm’s Machine Learning Group. “For example, model size can be minimized through quantization, network simplification, pruning, clustering, and model compression. At runtime in the device, smart scheduling to re-use intermediate values across layers also can reduce memory traffic and thereby speed up inference runtime.”

This puts tremendous pressure on memory developers to deliver as much bandwidth as possible, with the lowest power, area, and cost. And the trend shows no sign of abating. Neural network models continue to grow year over year, which means data sets also are growing because those models need to be trained.

“The size of the models, and the size of the training sets are growing by about an order of magnitude every year,” said Steven Woo, fellow and distinguished inventor at Rambus. “Early this year, when the newest kind of natural language processing model came out, it had about 17 billion parameters. That’s just enormous. Then, in the summer this year, a newer version came out that was 175 billion parameters. The number of parameters went up 10X in about seven months.”

Neural network models in the 1980s and early 1990s, had roughly 100 to 1,000 parameters. “If I have a larger model, I need more examples to train it because every one of those parameters has to get tuned,” Woo said. “Being the impatient kind of people that we are in technology, when you have more data, you don’t want to wait any longer to train anything. The only way out of the box is to have more bandwidth. You’ve got to be able to push that data more quickly into the system and pull it out more quickly. Bandwidth is the number one concern.”

Another concern is energy. “If all you had to do was to double the performance, and could just double the amount of power you consumed, life would be great,” Woo said. “But that’s not how it works. You have a power envelope you care about. Your wall sockets can handle only so much power. It turns out people really want the performance improved by a factor of X, but they want it more power-efficient by a factor of 2X, and that’s where it gets hard.”

It’s even harder on the inferencing side. “Today, there is a divide growing between the training and the inference parts of AI/ML,” noted Marc Greenberg, group director, product marketing, IP Group at Cadence. “Training requires the most memory bandwidth, and is often done on powerful server-type machines or very high-end GPU-type cards. In the training space, we’re seeing HBM memory at the high end and GDDR6 at the lower end of training. The HBM memories are particularly good at providing the highest bandwidth at the lowest energy-per-bit. HBM2/2E memories are providing up to 3.2/3.6 terabits per second of memory bandwidth between the AI/ML processor and each stack of memory, and the forthcoming HBM3 standard promises even higher bandwidth.”

Cost tradeoffs
That performance comes at a cost. “As a high-end solution there is a high-end price to match, which means that HBM will likely stay in the server room and other high-end applications for the time being,” Greenberg said. “GDDR6 technology helps to bring the cost down, with today’s devices offering 512Gbit/s per second at the 16Gbps data rate, with faster data rates on the horizon. It’s also common for users to put multiple of these in parallel. For example some graphics cards use 10 or more GDDR6 parts in parallel and/or higher speeds to achieve 5Tbps bandwidth and beyond.”

Inferencing is still evolving, and that is particularly evident at the edge. “For AI inference we’re mostly seeing the GDDR6 and LPDDR5 memories in new designs,” Greenberg said. “They’re offering possibilities for more moderate bandwidth at more moderate cost, which then allows AI to be deployed at the edge of the cloud and in real time without having to send all the data back to the server.”

Many of the AI machines that are being developed now are very well planned out, using very regular layouts and structures.

“If you think back to the SoC design era, you saw a lot of randomness happening on chips because the chips were very heterogeneous,” he said. “They had a lot of different functions and there were a lot of heterogeneous functions. The chips looked like mixing pots of different blocks. But when you look at an AI chip you’re going to see a very regular structure, because that’s how you’re going to manage a very large amount of data in a very parallel data flow across the chip. It’s a different architecture than what we’ve done in an SoC, and even many CPUs. It’s structured around how to flow data through that chip.”

All of this has a direct bearing on memory choices, particularly DRAM, which was predicted to run out of steam years ago. In fact, the opposite has happened. There are more options available today than ever before, and a price differential for each of them.

“For instance, we are at a transition point on standard DDR from DDR4 to DDR5,” said Vadhiraj Sankaranarayanan, technical marketing manager for DDR products at Synopsys. “Customers who have come to us with DDR4 requirements get enticed because that product lifespan is going to be long enough that they may want DDR5 support also. Similarly with LPDDR5, each of these newer standards, besides giving higher performance, has an advantage in power supply. There’s power reduction because these standards run at a lower voltage. There is also a benefit in RAS (reliability, availability, and serviceability). In terms of the features, because of the higher speeds, the DRAMs themselves are going to be equipped with features that allow correction of single bit errors that could occur anywhere in the subsystem.”

These options are necessary because memory configurations can vary significantly within today’s AI/ML applications. “We have seen design teams use LPDDR in addition to HBM, but it really depends on the bandwidth requirement,” Sankaranarayanan said. “There is also the cost factor that needs to be taken into consideration. With HBM, because there are multiple DRAM die stacked with through-silicon via technology — and because the DRAM and the SoC reside in the same SoC package using an interposer, which requires multiple packaging steps — the cost of each HBM is higher today. But as the usage of HBM increases because of AI-related applications, and with higher demand, those prices should become more reasonable in the near future.”

Power is primary
Not surprisingly, power management is the top consideration in AI/ML applications. That’s true for the data center as well as in edge devices.

In an SoC, the power allocated to memory can be broken into three components.

“First is the power it takes to retrieve the bits out of the DRAM core, and there’s really nothing you can do about that,” said Rambus’ Woo. “You have to get them out of the core to do something with them. Second, there’s the power to move the data, i.e., the power associated with the circuits on either end of the wire. Third, there’s the SoC PHY and the interface in the DRAM, as well. It turns out when the power is put in these buckets, it’s roughly a third in each of those buckets. Two thirds of the power is spent just moving the data back and forth between these two chips, which is a little scary because it means getting data out of the DRAM core — the thing you have to do — is not what’s dominating the power. In trying to become more power-efficient, one of the things you realize is maybe if you thought about stacking these things on top of each other you could get rid of a lot of this power, and that’s what happens in an HBM device. If you were to think about stacking the SoC with the DRAM, the power of the communication itself could drop by many times, maybe even an order of magnitude. That’s how much power you could potentially save.”

Fig. 1: HBM2 memory system power, based on PHY plus DRAM power at 2Gbps, streaming workload, with power breakdown of 100% reads and 100% writes. Source: Rambus

Fig. 1: HBM2 memory system power, based on PHY plus DRAM power at 2Gbps, streaming workload, with power breakdown of 100% reads and 100% writes. Source: Rambus

There is no free lunch here. “If you were to do that, you now would be limited more by the DRAM core power, and you’d have to think about how to reduce the DRAM core power to make the overall pie really small,” Woo said.

This is an area of ongoing research, but the solution isn’t obvious. As more of these bits are put onto a chip, they become smaller, and therefore don’t hold as many electrons. As a result, it’s harder to sense whether there’s a one or a zero there, and they don’t necessarily retain those electrons as long as might be desired, so they need to be refreshed more often.

New materials and new cell designs may help. Another option could be managing the power for the PHY, but there’s a circular dependency of everything on everything else, so the challenge on the PHY is really difficult.

“As you go faster, there is a tug of war going on between doing more things to ensure the correct transmission of the data,” Woo said. “It’s similar to an auctioneer. When they talk, they talk loudly. In the PHY, it’s the same phenomenon. In order to continue having distinguishable symbols, you have to have an appropriate amplitude, so one of the challenges is how to have the right amplitude and avoid slurring things to ensure that what the person is receiving is really what you’re saying. A lot of work goes into clearly enunciating the symbols that are going back and forth on the wire. There are other techniques to try dropping the amplitude, but they’re all tradeoffs. In general, people don’t want to change their infrastructure. All things being equal, if they could pick something that was more incremental over something that was more revolutionary, they would pick the incremental thing. That’s the challenge.”

On-die memory versus off-die memory
Another significant tradeoff in AI/ML SoC architectures today is where to place the memory. While many times AI chips have on-die memory, off-die memory is essential in AI training.

“The problem is basically how much data needs to be stored for the neural network that you’re trying to do,” said Cadence’s Greenberg. “For each neuron, there’s a certain amount of storage that’s needed. Everybody would love to use on-chip memory. Anytime that you can use on-chip memory, you want to use it. It’s super fast, it’s super low power, but it’s expensive. Every millimeter of area that you put on your main die is a millimeter of area that you can’t use for logic and other functions on the chip within a certain budget.”

On-die memory is very expensive because it is manufactured on essentially a logic process. “The logic process I’m manufacturing on may be a 7nm or 5nm process,” he said. “I’m doing it on a process that’s got something like 16 layers of metal, so it’s expensive to put memory on that logic chip. If you can do it on a discrete chip, then the memory process can be optimized for the cost target. It has a very limited amount of metal layers on top, and the cost per square millimeter of that memory die is significantly less than the cost per square millimeter of a 7nm or 5nm logic die.”

Most AI/ML engineering teams struggle with this decision because it’s still relatively early in the lifecycle of these designs. “Everybody starts from the position of wanting to keep all the memory on die,” Greenberg said. “There’s not really a standard that you can look at. In most AI chips, the floorplans are really different. The industry hasn’t decided the best architecture for AI, so we’re still essentially in an experimental stage with the architecture of AI chips, and moving toward what most people will probably settle around. But today it’s still very wide open. We still see a lot of innovation. So then, how do you recommend a memory type for that? It really comes back to some of the key parameters that everybody looks at in memory, which is how much memory do you need? How many gigabytes of data do you need to store? How fast do you want to get to it? How much PCB area do you have? How much do you want to pay? Everybody will solve that equation a little bit differently.”

Those kinds of decisions affect every aspect of AI/ML chips, including specialized accelerators. The big choices there hinge on power, performance and area, with sharp demarcations between cloud and edge.

“These two things are quite varied,” said Suhas Mitra, product marketing director in the Tensilica group at Cadence. “There are similarities, but they’re quite different. If you’re designing a processor for the data center cloud, power and area have a lot of meaning in the sense of how you do memory, memory hierarchy, how you put memory down.”

For edge computing, the tradeoffs continue to grow in complexity, with a fourth variable added to the traditional PPA equation — bandwidth. “The discussion should be about PPAB, because these four axes we constantly have to juggle,” Mitra said. “In a processor design or accelerator design for AI/ML, in figuring out the power, performance, area, bandwidth tradeoff, a lot depends upon the nature of the workload. When you’re talking about something at the edge, fundamentally, you need to be very efficient when it comes to how much performance is achieved in the area that is put down. How much performance do I get compared to the wattage I consume? We always want to talk about those metrics.”

He noted that’s why people spend so much time on looking at in-memory interfaces. For the processor/accelerator designer, these considerations take a different shape. “The shape and form essentially is with this AI workload. How do I make sure that I do compute in a very efficient manner because I have very little area to play with? You pigeonhole because you can’t sacrifice too much area, too much wattage, too much on compute. Where is the sweet spot for you to do all these things? You look at different workloads and try to figure out what the compute should be, what the frames per second should be, what the frames per second per watt should be, what the frames per second per millimeter square should be.”

AI architectures are in a state of rapid evolution. When and whether that will stabilize is anyone’s guess, and that makes it much harder to discern if choices are correct, and for how long.

“Are you on the right path? This question is correct, but there are many different answers,” Mitra said. “In traditional processor design, if I designed it this way, then this looks like this. So all the people went around and designed processor IPs, and there were some variants that people did such as VLIW versus superscalar, etc. But it will never be the case where one design will win out. You will find many different designs that will win out. It’s almost like saying you have 100 options given to you, 40 options preferably, but there will not be one solution. That’s the reason why, if we look at a Google TPU, it’s designed a certain way. You can come to some other accelerator and it is designed another way. Going forward, you will see many more of these architectural choices people will make because AI has many different meanings for different verticals.”

Leave a Reply

(Note: This name will be displayed publicly)