New applications require a deep understanding of the tradeoffs for different types of DRAM.
The number of options for how to build high-performance chips is growing, but the choices for attached memory have barely budged. To achieve maximum performance in automotive, consumer, and hyperscale computing, the choices come down to one or more flavors of DRAM, and the biggest tradeoff is cost versus speed.
DRAM remains an essential component in any of these architectures, despite years of efforts to replace it with a faster, cheaper, or more universal memory, or even to embed it into an SoC. But instead of remaining static, DRAM makers stepped up with a variety of options based upon performance, power, and cost. Those remain the fundamental tradeoffs, and to navigate those tradeoffs requires a deep understanding for how memory will be used, how all the pieces will be connected, and what are the key attributes of the chip or the system in which it will be used.
“We continue to see very aggressive trends in the need for more bandwidth memory, even with the macro-economic situation,” said Frank Ferro, senior director product management at Rambus. “There are a lot of companies looking at different types of architectures for memory. That includes various ways to solve their bandwidth problems, whether it be processors with lots of on-chip memory, or otherwise. While this approach is going to be the cheapest and fastest, the capacity is pretty low, so the AI algorithm has to be tailored for that type of architecture.”
Chiplets
That still doesn’t reduce the need for attached memory, though. And the move toward heterogeneous computing in general, and chiplets in particular, has only accelerated the need for high-bandwidth memory, whether that is HBM, GDDR6, or LPDDR6.
HBM is the fastest of the three. But so far, HBM has been based on 2.5D architectures, which limits its appeal. “It’s still relatively expensive technology to do the 2.5D interposer,” Ferro said. “The supply chain problems didn’t help things too much. Over the last two years that’s eased a little bit, but it did highlight some of the problems when you’re doing these complex 2.5D systems because you have to combine a lot of components and substrates. If any one of those pieces is not available, that disrupts the whole process or imposes a long lead time.”
Fig. 1: HBM stack for maximum data throughput. Source: Rambus
There has been work underway for some time to connect HBM to some other packaging approach, such as fan-outs, or to stack chips using different kinds of interposers or bridges. Those will become essential as more leading-edge designs include some type of advanced packaging with heterogeneous components that may be developed at different process nodes.
“A lot of that HBM space is really more about manufacturing issues than IP issues,” said Marc Greenberg, group director for product marketing in Cadence‘s IP Group. “When you have a system with a silicon interposer inside, you need to figure out how to construct a system with a silicon interposer in it. First, how are you going to have the silicon interposer manufactured there? It’s much larger than regular silicon die. It has to be thinned. It has to be bonded to the various die that are going to be on it. It needs to be packaged. There’s a lot of specialized manufacturing that goes into an HBM solution. That ends up being outside of the realm of IP and more into the realm of what ASIC vendors and OSATs do.”
High bandwidth memory in automotive
One of the areas where HBM is gaining significant interest is in automotive. But there are hurdles to overcome, and there is no timeline yet for how to solve them.
“HBM3 is high-bandwidth, low-power, and it has good density,” said Brett Murdock, director of product marketing at Synopsys. “The only problem is it’s expensive. That’s one downfall to that memory. Another downfall for HBM is that it is not qualified for automotive yet, even though it would be an ideal fit there. In automotive, one of the interesting things that’s happening is that all the electronics are getting centralized. As that centralization happens, basically there’s now a server going in your trunk. There’s so much going on that it can’t necessarily always happen on a single SoC, or a single ASIC. So now the automotive companies are starting to look at chiplets and how they can use chiplets in their designs to get all the compute power they need in that centralized domain. The neat thing there is that one of the potential uses of chiplets is with interposers. And if they’re using interposers now, they’re not solving the interposer problem for HBM. They’re solving the interposer problem for the chiplet, and maybe HBM gets to come along for the ride. Then, maybe, it’s not quite as expensive anymore if they’re already doing chiplet designs for a vehicle.”
HBM is a natural fit there because of the amount of data that needs to move quickly around a vehicle. “If you think about the number of cameras in a car, the data rate of all these cameras and getting all that information processed is astronomical. HBM is the place where all the automotive people would like to go,” Murdock said. “The cost probably isn’t so prohibitive for them as much as it is just getting the technology sorted out, getting the interposer in the car sorted out, and getting the automotive temperatures for the HBM devices sorted out.
This may take awhile, though. In the meantime, GDDR appears to be the rising star. While it has more limited throughput than HBM, it’s still sufficient for many applications and it’s already automotive-qualified.
“HBM is absolutely going into applications for automotive where cars are talking to something that’s not moving,” said Rambus’ Ferro. “But in the vehicle, GDDR has done a nice job. LPDDR already was in the car, and you can replace a number of LPDDRs with GDDR, get a smaller footprint, and higher bandwidth. Then, as the AI processing is going up, with LPDDR5 and LPDDR6 starting to get up to some pretty respectable speeds [now approaching 8Gbps and 10Gbps, respectively], they’re also going to be a very viable solution in the car. There will still be a smattering of DDR, but LPDDR and GDDR are going to be the favorite technologies for automotive.”
That approach may work well enough for quite some time, according to Cadence’s Greenberg. “A solution that just uses a standard PCB, and a standard manufacturing technology, would seem to be a more sensible solution than trying to introduce, for example, a silicon interposer into the equation and to qualify that for temperature or vibration or a 10 year lifetime. To try to qualify that HBM solution in a vehicle seems to be a much bigger challenge than a GDDR-6 where you can put a memory on a PCB. If I was in charge of some automotive projects, at an automotive company, I would only choose HBM as a last resort.”
Edge AI/ML memory needs
GDDR and LPDDR5, and maybe even LPDDR6, are starting to look like viable solutions on some of the edge accelerator cards, as well.
“For PCIe cards doing edge AI inferencing, we’ve seen GDDR out there for a number of years in accelerator cards from companies like NVIDIA,” Ferro said. “Now we’re seeing more companies willing to consider alternatives. For example, Achronix is using GDDR6 in its accelerator cards, and starting to look at how LPDDR could be used, even though the speed is still about half that of GDDR. It’s creeping up, and it gives a little bit more density. So that’s another solution. Those give a nice tradeoff. They provide the performance and the cost benefit, because they still use traditional PCBs. You’re soldering them down on the die. If you’ve used DDR in the past, you could throw out a lot of DDRs, and replace them with one GDPR or maybe two LPDDRs. That’s what we’re seeing a lot of right now as developers try to figure out how to hit the right balance between cost, power, and price. That’s always a challenge at the edge.”
As always, the tradeoffs are a balance of many factors.
Greenberg noted that in the early stages of the current AI revolution, the first HBM memories were being used. “There was a cost-is-no-object/bandwidth-is-no-object methodology that people were adopting. HBM fit very naturally into that, where somebody wanted to have a poster child for how much bandwidth they could have out of the system. They would construct a chip based on HBM, get their venture capital funding based on their performance metrics for that chip, and nobody was really too worried about how much it all cost. Now what we’re seeing is that maybe you need to have some good metrics, maybe 75% of what you could achieve with HBM, but you want it to cost half as much. How do we do that? The attractiveness of what we’ve been seeing with GDDR is that it enables a lower-cost solution, but with bandwidths definitely approaching the HBM space.”
Murdock also sees the struggle to make the right memory choice. “With high bandwidth requirements, usually they are making that cost tradeoff decision. Do I go to HBM, which typically would be very appropriate for that application were it not for the cost factor? We have customers asking us about HBM, trying to decide between HBM and LPDDR. That’s really the choice they’re making because they need the bandwidth. They can get it in either of those two places. We’ve seen engineering teams putting up to 16 instances of LPDDR interfaces around an SoC to get their bandwidth needs satisfied. When you start talking about that many instances, they say, ‘Oh, wow, HBM really would fit the bill very nicely.’ But it still comes down to cost, because a lot of these companies just don’t want to pay the premium that HBM3 brings with it.”
There are also architecture considerations that come with HBM. “HBM is a multi-channel interface to begin with, so with HBM you have 32 pseudo channels on one HBM stack,” Murdock said. “There are 16 channels, so really 32 pseudo channels. The pseudo channels are where you’re doing the actual workload on a per-pseudo-channel basis. So if you have 16 pseudo channels there, versus if you’re putting a lot of different instances of an LPDDR onto your SoC, in both cases you have to sort out how your traffic is going to target the overall address space in your overall channel definitions. And in both cases you hav a lot of channels, so maybe it’s not too awfully different.”
For the AI/ machine learning developers, LPDDR typically comes in a bi-32 package, and then has 2-16 bit channels on it.
“You have a basic choice to make in your architecture,” he explained. “Do I treat those two 16-bit channels on the memory as truly independent channels from the system viewpoint? Or do I lump them together and make it look like a single 32-bit channel? They always select the 16-bit channel because that gives them a little higher performance interface. Inside the memory, I’ve got two channels. I have twice as many open pages that I could potentially hit from and reduce my overall system latency by having page hits. It makes for a better-performing system to have more smaller channels, which is what we’ve seen happen with HBM. From HBM2e to HBM3, we dropped that channel and pseudo channel size very specifically to address that kind of market. We even saw that in DDR5 from DDR4. We went from a 64-bit channel in DDR4 to a pair of 32-bit channels in DDR5, and everybody’s liking that smaller channel size to help amp up the overall system performance.”
For edge AI inferencing, Greenberg has been observing these applications come to the forefront, and finding that GDDR-6 is a great technology. “There are a lot of chips that want to have that function. This brings the AI inference close to the edge, so that you may be taking in multiple camera inputs or multiple other sensor inputs. Then, using AI right there at the edge, you can get insights into that data that you’re processing right there rather than sending all of the data back to a server to do that function.”
Greenberg expects to see a lot of chips coming out fairly soon that will have all kinds of interesting capabilities without having to send a lot of data back to the server. He expected GDDR6 to play a significant role there.
“The previous generations of GDDR were very much targeted at graphics cards,” he said. “GDDR6 had a lot of features in it that made it much more suitable as a general-purpose memory. In fact, while we do have users who are using it for graphics cards, the majority are actually using it for AI edge applications,” Greenberg said. “If you need the most bandwidth that you can possibly have, and you don’t care how much it costs, then HBM is good solution. But if you don’t need quite as much bandwidth as that, or if cost is an issue, then GDDR6 plays favorably in that space. The advantage of GDDR6 is that it can be done on a standard FR4 PCB. There are no special materials required in the manufacturing. There are no special processes, and even the PCB itself doesn’t need to be back-drilled. It doesn’t need to have hidden vias or anything like that.”
Finally, one last trend in the GDDR space, involves efforts to make GDDR even more consumer-friendly. “It still has some parts of the specification that are very favored toward graphics engines, but as a technology GDDR is evolving in the consumer direction,” he said. “It will continue to evolve in that direction with even wider deployment of GDDR-type technologies.”
Leave a Reply