Capacity, speed, power and cost become critical factors in memory for AI/ML applications.
Designing memories for high-performance applications is becoming far more complex at 7/5nm. There are more factors to consider, more bottlenecks to contend with, and more tradeoffs required to solve them.
One of the biggest challenges is the sheer volume of data that needs to be processed for AI, machine learning or deep learning, or even in classic data center server racks.
“The designs get bigger, but the parasitics also get larger,” said Joy Han, senior principal product manager at Cadence. “The number of PVT (process, voltage and temperature) corners that are required to accurately characterize memory has gone up significantly.”
All of these issues speak to the heart of how system architects can squeeze more compute power out of the technology — whether it is 7nm or 5nm — and turn it into real compute power.
“There are bottlenecks in today’s architectures, mainly how you get all of the data between the memory and the compute,” said Magdy Abadir, vice president of marketing at Helic. “To solve this, there are many approaches. Some of them bring the memory in as close as possible to the CPU, which means that the data buses have to be very fast, and that solves a bit of the problem by not going off-chip all the time. However, you can’t pack all the memory you need inside the chip next to the CPU, so you still have to have a next-level hierarchy of the memory that’s sitting as close to the chip as possible. The capacities of these things are getting extremely high and are using esoteric technologies like stacked memories with TSVs.”
In many high-performance applications, the challenges from a system design standpoint revolve around how to get more bandwidth to fit in a reasonable area on the chip with a reasonable power profile. For example, high-bandwidth memory (HBM 2) is very efficient from a power and area standpoint because it uses 3D stacking technology, explained Frank Ferro, senior director of product management at Rambus.
But the tradeoff here is cost. HBM 2 is more expensive, and so far the primary applications of this technology are tied to some form of advanced packaging—2.5D or high-end fan-outs—which have been developed with high-performance in mind rather than cost. The alternative is some combination of DDRs and/or GDDRs, which can be combined to achieve more performance than a traditional DRAM solution, but they require a larger area and more chips.
Fig. 1: Samsung’s HBM2. Source: Samsung
“Do I use 5 or 10 DDRs or GDDRs or 1 HBM stack?” Ferro said. “What are the system tradeoffs, power tradeoffs and performance tradeoffs? It’s interesting because SerDes has been driving up to 56 or 112 GB, so there are all of these very high-speed links in the system and now you’re moving data very rapidly, but now you have to start to store it and process it very rapidly, too. As a result, we continue to see in the networking market and in the enterprise market [engineering teams asking] how to get more memory bandwidth to go with all this moving of data around.”
Helic’s Abadir agrees. “In order to get the capacity and the speed, you have to go these extremes and you try to integrate as much as possible in a small area to achieve the speed, the access rates and be able to pass a lot of data very quickly,” he said, noting the problem gets even trickier with machine learning/AI applications because they are so data-intensive. “You’re doing a lot of things on a lot of data. This is the common theme on these types of solutions. Technologically, you look at the trend of not slowing down in terms of how much memory we can pack in a given SoC, but there are design challenges to make these things work. With this new technology that has opened up the door to new opportunities, you’re not changing your basic design concerns and the design flow that has been used for older generations, especially around something as basic as inductance. We do know that as the frequency gets higher, as these data rates get higher, there is going to be interference and everybody knows it. Have you changed your design flow? The answer [from users] is, ‘Not really.’”
Tradeoffs
While HBM 2 and GDDR are competing for the same application space today, HBM 2 has a lot of runway left. JEDEC has plans for at least two more generations of HBM, and at least one more half-step with HBM 3+.
“You could ask which memory gives me better performance today, and is cheaper,” said Graham Allan, product marketing manager for memory interfaces at Synopsys. “GDDR is probably cheaper today. But you can get the same performance from one HBM stack as about five GDDR memories. And if you look at these high-end GPUs, they often have 12 or 14 32-bit GDDR memories in a giant horseshoe around them on the PCB. So they take a lot of physical area on the board, whereas the HBM does not take a lot of area and is much more efficient from a power perspective. And that’s the key metric that all of these GPUs care about.”
Space becomes one of the key tradeoffs, and that has to be balanced against cost and the form factor of the end device. “Memory occupies the majority of the space on a die, so if you address the memory, you’ve addressed a major part of your design already,” said Cadence’s Han. “What we have seen customers do with static RAM (SRAM), as an example, they might go to memory compilers to get the SRAM. We’ve also seen that they might do customization on the memory compiler. But we’ve also seen some cases that are really aggressive where they design their own memory. So it’s not really coming out of a compiler. They design it from scratch. We have seen multiple variations on the types of memories that they end up getting, which could be coming from IP providers straight, or it could be a combination of IP providers plus some customization work— or it could be entirely customized so that it will be a differentiator for their design.”
Customization can happen on a number of fronts. “For example, the kind of customization that some users want to see is that with the memory compilers that are being provided, they might want to add in additional components,” she said. “Or they may look at PVT corners that have been characterized, and they come back and say they want more coverage. They want to know how the memory will behave in their areas of interest, which may not correlate with what the memory compiler from the IP provider is providing.”
That customization also can require low power, Han said. “[Engineering teams] want to make sure that the product is going to last a long time, with long battery life, so the supply voltage cannot be too high.”
What’s changed
For years, high-performance memory was tied to the classic data center server model. In fact, data center applications represent the traditional application for memory.
“If we look farther back in the DDR history, the memories typically were used first in desktops/clients, even before laptops became dominant,” said Synopsys’ Allan. “Starting in about the DDR4 generation, all of the effort involved in defining these new standards for DRAMs has focused on the server environment. Once that is done and the products are introduced to market and they start selling into that application, they waterfall down into the other applications, which are laptops and desktops for DDR. That’s why you would see servers using the more advanced DDR memories earlier than you would be able to go and buy them in a Dell laptop, for example.”
The primary challenge that servers face is capacity. “Servers need an incredible amount of DRAM,” said Allan. “You try to cram as many DRAM bits as you can onto a memory module, and then you want to try to create an environment where you can have as many memory modules as possible in the box. You are trying to get so many gigabytes of DRAM in that box while still trying to not blow all the fuses in the power supply and not break the bank in terms of cost. Traditionally, you would lay down a bunch of components on a DIMM and you would plug that DIMM into a socket, and there would be multiple sockets on the channel. As the speeds have increased over time, every time you take a step you run into a brick wall. So the problem is when we’re trying to fan out to talk to that many components, things break down. We’re loading the bus too heavily. The bus can’t go very fast.”
This is why register DIMMs and load-reduced DIMMs initially were created. After that, buffers and registers would be inserted in between the host and a large number of the DRAMs, which worked to some degree. More recently, DRAM vendors have begun stacking memory chips inside an HBM package, using either 4- or 8-chip modules connected with TSVs.
“This is the way of the future,” said Allan. “The components are still quite expensive because there are challenges with ramping up that technology to high volume manufacturing, but the 3D stacked DRAM has the potential to make the load-reduced DIMM a thing of the past.”
Other factors
In other markets, particularly automotive and AI, the primary memory type is some variant of GDDR.
“If you look at Nvidia’s stock chart over the last couple of years, there are reasons why its taken off so much,” said Allan. “High performance computing for these particular applications has started to almost completely take advantage of this highly parallel processing made available by what used to be graphics processors. The problem happens to be the same kind of problem that a graphics processor addresses, and what’s getting pulled along in that wake is the DRAM that connects to that graphics processor. It has opened up an entirely new market for what used to be a relatively niche area of gaming based on high-end graphics processors from companies like AMD and Nvidia.”
These traditionally started off talking to DDR3 and to some degree DDR4. Then specialized graphics memories came along for the very high end of the market, namely GDDR, which is now up to GDDR6.
“These are highly specialized memories that go incredibly fast,” said Allan. “For example, GDDR6, is approximately five or six times faster than the fastest DDR4. That’s because they are all point-to-point, so there’s no multiple loading, there are no other components in the way. It’s a very simple schematic, but a very tightly controlled implementation. The signal integrity environment is going to be the long pole in the tent that is going to limit the performance, so the package has to be very carefully designed, the PCB has to be very carefully designed. Cost is added into that equation because there must be more layers on the PCB, more layers in the package, more on-die decoupling on the SoC to make these interfaces work. This is a wide parallel interface with no embedded clock running at 18 gigabits per second. It’s probably hitting the end of the road at about that data rate. Knowing that has been coming for a long time has been the birth of the HBM option. There’s nothing more power efficient than HBM because of the incredibly short connectivity, and because it’s not a terminated interface. It’s presently at 2.4 gigabits/second per pin. It just uses a heck of a lot of pins. A 32-bit GDDR memory does the opposite. It doesn’t use a lot of pins. It just cranks the clock up really fast.”
Manoj Roge, vice president of strategic planning at Achronix Semiconductor agreed that memory often becomes a big bottleneck not just from the bandwidth point of view, but it is where a lot of power is wasted. “Architects must pay really careful attention to the data transfers, and what is really important is understanding the data flow and optimizing the data transfers.”
To be sure, AI workloads are seeing the most growth, and what the Googles and Microsofts of the world are doing is running a lot of the internal workloads and using AI for mail filters and photo classifications, he noted. “From that perspective, the datacenter memory requirements at the highest level, the hyperscalers really care about opex costs more than just capex because power is a big component and what you will hear talking to different systems engineers is that memory transfer become a big component of the power. Everybody is looking at the best way of optimizing the memory transfers.”
Roge reminded that memory transfer is the problem because a lot of power is wasted in transferring data between the compute and the memory. There is a published paper out of Stanford University that details this that quantifies if the power wasted in compute is 1x, the power transfer of data from the compute to the memory, which is sitting right next to it would be in the range of 1 to 10x — for something like a level 1 cache or tightly coupled memory. “When you go to the level 2 cache or a big block of memory sitting at a corner of a die, that could be another 10 to 20x. Then you go to the external memory like DDR and such, and that would be on the order of 100x power compared to what is wasted in a compute.”
As such, in the context of hardware/software co-design, for the most efficient system design, engineering teams need to look into the data flow carefully and optimize the memory hierarchy based on the data flow, he stressed. “This is because you want to minimize the power or energy wasted in the data transfer. What that means is you need to really carefully think about memory hierarchy. How much do you put on chip? For a class of products like Nvidia’s GPUs or Intel’s CPUs, that is in a logic process, so it has many levels of metal so you don’t want to put a lot of memory on die because that will be more expensive than off-chip memory. You won’t get the cheapest memory if you put it on die but you get the most efficiency if you put the memory on die. Then, you go external, whether it is a DDR class of memory or GDDR class of memory to get high bandwidth. Then one has to look at the tiers of memory: are you solving the memory bandwidth problem? There are two options: HBM and GDDR. Are you solving the capacity problem? For this, you need to consider DDR-4 or the emerging DDR-5, and so on.”
All of this all circles back to a discussion about tradeoffs, and those tradeoffs can include a variety of factors ranging from cost to performance to time to market.
“What you do not want to do is re-spin the silicon,” said Cadence’s Han. “You really want to get silicon success on your first try. You’re looking for a way, before you reach silicon, to get really good accuracy correlations with silicon during the design and verification process. We see Moore’s Law coming in and we’re seeing that the type of design gets so big, but time to market is also very aggressive, so you really want to increase the productivity. Users run a lot of simulations and they are asking for better performance and better accuracy, because instead of having this simulation run finish in five days, they want it to finish in a day.”
Related Stories
How To Choose The Right Memory
Different types and approaches can have a big impact on cost, power, bandwidth and latency.
A New Memory Contender?
FeFETs are a promising next-gen memory based on well-understood materials.
The Future Of Memory
Experts at the table, part 3: Security, process variation, shortage of other IP at advanced nodes, and too many foundry processes.
Will China Succeed In Memory?
The country is banking on DRAM and NAND to reduce its trade deficit.
How AI Impacts Memory Systems
The ways different architectures get around the memory bottleneck
Memory Market: Will History Repeat Itself?
China can’t support three DRAM companies, but it’s likely that one of them will be successful.
Shorten the path of D-2-D interconnect and reduce the electrical resistance of D-2-D interconnect are the synergy of the trade-off.