Upcoming versions of high-bandwidth memory are thermally challenging, but help may be on the way.
High-bandwidth memory (HBM) is becoming the memory of choice for hyperscalers, but there are still questions about its ultimate fate in the mainstream marketplace. While it’s well-established in data centers, with usage growing due to the demands of AI/ML, wider adoption is inhibited by drawbacks inherent in its basic design. On the one hand, HBM offers a compact 2.5D form factor that enables tremendous reduction in latency.
“The good thing about HBM is you get all this bandwidth in a very small footprint, and you’ll also get very good power efficiency,” said Frank Ferro, senior director of product marketing at Rambus, in a presentation at this week’s Rambus Design Summit.
The downside is that it relies on expensive silicon interposers and TSVs to function.
Fig. 1: HBM stack for maximum data throughput. Source: Rambus
“One of the things that plagues high bandwidth memory right now is cost,” said Marc Greenberg, group director for product marketing in the IP group at Cadence. “3D stacking is expensive. There’s a logic die that sits at the base of the stack of dies, which is an additional piece of silicon you have to pay for. And then there’s a silicon interposer, which goes under everything under the CPU or GPU, as well as the HBM memories. That has a cost. Then, you need a larger package, and so on. There are a number of system costs that take HBM as it exists today out of the consumer domain and put it more firmly in the server room or the data center. By contrast, graphics memories like GDDR6, while they don’t offer as much performance as the HBM, do so at significantly less cost. The performance per unit cost on GDDR6 is actually much better than HBM, but the maximum bandwidth of a GDDR6 device doesn’t match the maximum bandwidth of an HBM.”
These differences provide compelling reasons why companies settle on HBM, even if it may not have been their first choice, said Greenberg. “HBM provides a huge amount of bandwidth and the energy per-bit-transferred is extremely low. You use HBM because you have to, because you have no other solution that can give you the bandwidth that you want, or the power profile that you want.”
And HBM is only getting faster. “We expect HBM3 Gen2 to deliver up to a 50% improvement in bandwidth,” said Praveen Vaidyanathan, vice president and general manager of Micron’s Compute Products Group. “From a Micron perspective, we anticipate volume production of our HBM3 Gen2 offering during the course of our fiscal year 2024. In the early part of calendar year 2024, we expect to begin contributing to the anticipated several hundred million dollars in revenue opportunity over time. Additionally, we predict that Micron’s HBM3 will contribute higher margins than DRAM.”
Still, economics may force many design teams to consider alternatives for price-sensitive applications.
“If there’s any other way that you can subdivide your problem into smaller parts, you may find it more cost-effective,” Greenberg noted. “For example, rather than taking a huge problem and saying, ‘I have to execute all of this on one piece of hardware, and I have to have HBM there, maybe I can split it into two parts and have two processes running in parallel, perhaps connected to DDR6. Then I could potentially get the same amount of computation done at less cost if I’m able to subdivide that problem into smaller parts. But if you need that huge bandwidth, then HBM is the way to do it if you can tolerate the cost.”
Thermal challenges
The other major downside is that HBM’s 2.5D structure traps heat, which is exacerbated by its placement near CPUs and GPUs. In fact, in trying to give a theoretical example of poor design, it’s difficult to come up with something worse than current layouts which place HBMs, with their stacks of heat-sensitive DRAMs, near compute-intensive heat sources.
“The biggest challenge is thermal,” Greenberg said. “You have a CPU, which by definition is generating a huge amount of data. You’re putting terabits per second through this interface. Even if each transaction is a small number of picojoules, you’re doing a billion of them every second, so you have a CPU that’s very hot. And it’s not just moving the data around it. It has to compute, as well. On top of that is the semiconductor component that likes heat the least, which is a DRAM. It starts to forget stuff about 85°C, and is fully absent-minded about 125°C. These are two very opposite things.”
There’s one saving grace. “The advantage of having a 2.5D stack is that there’s some physical separation between the CPU, which is hot, and an HBM sitting right next to it, which likes to be cold,” he said.
In the tradeoff between latency and heat, latency is immutable. “I don’t see anybody sacrificing latency,” said Brett Murdock, product line director for memory interface IP solutions at Synopsys. “I see them pushing their physical team to find a better way to cool, or a better way to place in order to maintain the lower latency.”
Given that challenge, multi-physics modeling can suggest ways to reduce thermal issues, but there is an associated cost. “That’s where the physics gets really tough,” said Marc Swinnen, product manager at Ansys. “Power is probably the number one limiting factor on what is achievable in integration. Anybody can design a stack of chips and have them all connected, and all that can work perfectly, but you won’t be able to cool it. Getting the heat out is a fundamental limitation on what’s achievable.”
Potential mitigations, which can quickly get expensive, range from microfluidics channels to immersion in non-conductive fluids to determining how many fans or fins on a heatsink are needed, and whether to use copper or aluminum.
There may never be a perfect answer, but models and a clear understanding of desired outcomes can help create a reasonable solution. “You have to define what optimal means to you,” Swinnen said. “Do you want best thermal? Best cost? Best balance between the two? And how are you going to weigh them? The answer relies on models to know what’s actually going on in the physics. It relies on AI to take this welter of complexities and create meta models that capture the essence of this particular optimization problem, as well as explore that vast space very quickly.”
HBM and AI
While it’s easy to imagine that compute is the most intensive part of AI/ML, none of this happens without a good memory architecture. Memory is required to store and retrieve trillions of calculations. In fact, there’s a point at which adding more CPUs doesn’t increase system performance because the memory bandwidth isn’t there to support them. This is the infamous “memory wall” bottleneck.
In its broadest definition, machine learning is just curve fitting, according to Steve Roddy, chief marketing officer of Quadric. “With every iteration of a training run, you’re trying to get closer and closer and closer to a best fit of the curve. It’s an X-Y plot, just like in high school geometry. Large language models are basically that same thing, but in 10 billion dimensions, not 2 dimensions.”
Thus, the compute is relatively straightforward, but the memory architecture can be mind-boggling.
“Some of these models have 100 billion bytes of data, and for every iteration for retraining, you have to take 100 billion bytes of data off of the disk across the backplane of the data center and into the compute boxes,” Roddy explained. “You’ve got to move this giant set of memory values back and forth literally millions of times over the course of a two-month training run. The limiting factor is moving that data in and out, which is why the interest in things like HBM or optical interconnects to get from memory to the compute fabric. All of those things are where people are pouring in literally billions of dollars of venture capital, because if you could shorten that distance or that time you dramatically simplify and shorten the training process, whether that’s cutting the power out or speeding the training.”
For all of these reasons, high-bandwidth memory is agreed to be the memory of choice for AI/ML. “It’s giving you the maximum amount of bandwidth that you’re going to need for some of these training algorithms,” Rambus’ Ferro said. “And it’s configurable from the standpoint that you can have multiple memory stacks, which gives you very high bandwidth.”
This is why there is so much interest in HBM. “Most of our customers are AI customers,” Synopsys’ Murdock said. “They are making that one big fundamental tradeoff between an LPDDR5X interface and an HBM interface. The only thing that’s holding them back is cost. They really want to go to HBM. That’s their heart’s desire in terms of the technology, because you can’t touch the amount of bandwidth you can create around one SoC. Right now, we’re seeing six HBM stacks put around an SoC, which is just a tremendous amount of bandwidth.”
Nevertheless, AI demands are so high that HBM’s cutting-edge signature of reduced latency is suddenly looking dated and inadequate. That, in turn, is driving the push to the next-generation of HBM.
“Latency is becoming a real issue,” said Ferro. “In the first two rounds of HBM, I didn’t hear anybody complain about latency. Now we’re getting questions about latency all the time.”
Given current constraints, it’s especially important to understand your data, Ferro advised. “It may be continuous data, like video or voice recognition. It could be transactional, like financial data, which can be very random. If you know the data is random, the way you set up a memory interface will be different than streaming a video. Those are basic questions, but it also goes deeper. What are the word sizes I’m going to use in my memory? What are the block sizes of the memory? The more you know about that, the more efficiently you can design your system. If you understand it, then you can customize the processor to maximize both the compute power and the memory bandwidth. We are seeing many more ASIC-style SoCs that are going after specific segments of the market for more efficient processing.”
Making it cheaper (maybe)
If the classic HBM implementation is to use a silicon interposer, there is hope for less costly solutions. “There’s also approaches where you embed just a little piece of silicon in a standard package, so you don’t have a full silicon interposer that extends underneath everything,” Greenberg said. “You just have a bridge between the between the CPU and HBM. In addition, there are advances that are allowing finer pin pitches on standard package technology, which would reduce the cost significantly. There are also some proprietary solutions out there, where people are trying to connect a memory over high-speed SerDes type connections, along the lines of UCIe, and potentially connecting memory over those. Right now, those solutions are proprietary, but I would look for those to become standardized.”
Greenberg said there may be parallel tracks of development: “The silicon interposer does provide the finest pin pitches or wire pitches possible — basically, the most bandwidth with the least energy — so silicon interposers will always be there. But if we can, as an industry, get together and decide on a memory standard that works on a standard package, that would have the potential of giving a similar bandwidth but at significantly less cost.”
There are ongoing attempts to reduce the cost for the next generation. “TSMC has announced they’ve got three different types of interposer,” Ferro said. “They’ve got an RDL interposer, they’ve got the silicon interposer, and they’ve got something that looks kind of like a hybrid of the two. There are other techniques, like how to get rid of the interposer altogether. You might see some prototypes come out in the next 12 or 18 months of how to stack 3D memory on top and that would theoretically get rid of the interposer. IBM actually has been doing that for years, but now it’s getting to the point where you don’t have to be an IBM to do this.”
The other way to solve the problem is to use less expensive materials. “There’s research into very fine pitch organic materials, and if they can be small enough to handle all those traces,” said Ferro. “In addition, UCIe is another way to connect chips through a more standard material to save to save cost. But again, you still have to solve the problem of many thousands of traces through these substrates.”
Murdock looks to economies of scale to cut costs. “The cost side will be somewhat alleviated as HBM grows in popularity. HBM, as any DRAM, is a commodity market at the end of the day. On the interposer side, I don’t see that dropping as quickly. That one still is going to be a bit of a challenge to overcome.”
But raw cost is not the only consideration. “It also comes down to how much bandwidth does the SoC need, and other costs such as board space, for example,” Murdock said. “LPDDR5X is a very popular alternative for folks who want a high-speed interface and need a lot of bandwidth, but the number of channels of LPDDR5X that are required to match that of an HBM stack is fairly substantial. You have a lot of device costs, and you have a lot of board space costs that may be prohibitive. In terms of just the dollars, it also could be some physical limitations that might move somebody to an HBM, even though dollars-wise it is more expensive.”
Others are not so sure about future cost reductions. “HBM costs would be a challenge to reduce,” said Jim Handy, principal analyst at Objective Analysis. “The processing cost is already significantly higher than that of standard DRAM because of the high cost of putting TSVs on the wafer. This prevents it from having a market as large as standard DRAM. Because the market is smaller, the economies of scale cause the costs to be even higher in a process that feeds on itself. The lower the volume, the higher the cost, yet the higher the cost, the less volume will be used. There’s no easy way around this.”
Nevertheless, Handy is upbeat about HBM’s future, noting that it still pencils out well compared to SRAM. “HBM is already a well-established JEDEC-standard product,” he said. “It’s a unique form of DRAM technology that provides extremely high bandwidth at considerably lower cost than SRAM. It can also be packaged to provide much higher densities than are available with SRAM. It will improve over time, just as DRAM has. As interfaces mature, expect to see more clever tricks that will increase its speed.”
Indeed, for all of the challenges, there’s definite cause for HBM optimism. “The standards are moving rapidly,” Ferro added. “If you look at the evolution of HBM these days, it’s roughly at a two-year cadence, which is really a phenomenal pace.”
Further reading
Choosing The Correct High-Bandwidth Memory
New applications require a deep understanding of the tradeoffs for different types of DRAM.
What’s Next For High Bandwidth Memory
Different approaches for breaking down the memory wall.
Leave a Reply