Improving Performance And Power With HBM3

New memory standard adds significant benefits, but it’s still expensive and complicated to use. That could change.


HBM3 swings open the door to significantly faster data movement between memory and processors, reducing the power it takes to send and receive signals and boosting the performance of systems where high data throughput is required. But using this memory is expensive and complicated, and that likely will continue to be the case in the short term.

High Bandwidth Memory 3 (HBM3) is the most recent addition to the HBM specification developed by JEDEC for stacking DRAM layers inside a single module. Introduced in January 2022, it is viewed as a major improvement for 2.5D packages. But HBM3 remains costly, in part because of the price of the memory itself, and in part because of the cost of other components such as silicon interposers and the engineering required to develop a 2.5D design. That has limited its use to the highest-volume designs, or price-insensitive applications such as servers in data center, where the cost of the memory can be offset with improved performance due to more and wider lanes for data, and less power required to drive signals back and forth between processing elements and DRAM.

This helps explain why HBM3 has shown up first in NVIDIA’s “Hopper” H100 enterprise GPU, with products from Intel and AMD not far behind. HBM3 offers several enhancements over HBM2E, most notably the doubling of bandwidth from HBM2E at 3.6 Gbps up to 6.4Gbps for HBM3, or 819 GBps of ​​bandwidth per device.

“Bandwidth is what’s needed to support bigger [compute] engines,” said Joe Macri, senior vice president CTO of AMD’s Client PC Business. “If you look at a lot of the problems we’re solving, they are very bandwidth-heavy, whether it’s machine learning or HPC-type solutions. So even if we only chose to have a modest increase in engine size, we’d still benefit tremendously from an increase of bandwidth.”

In addition to increasing capacity and speed, the improvements in energy efficiency are noteworthy. With HBM3, the core voltage is 1.1V, compared to HBM2E’s 1.2V core voltage. HBM3 also reduces the I/O signaling to 400mV versus 1.2V for HBM2E. There will be further improvements in future generations, as well.

“Once you get down into the 10nm era, you have to think about different scale-down technologies — things like high-K metal gate, for example — where we have to continuously increase our memory bandwidth,” said Jim Elliot, executive vice president for memory products at Samsung Semiconductor, at a recent presentation. “There is low-resistance material, because we have to push the limits of cell size for the DRAM components. And there are wide-bandgap materials, because we’re looking for 10X improvements in leakage, as well as finFETs for DRAM, which will allow us to continue to scale power beyond 0.9 volts.”

Fig. 1: Samsung’s new HBM3. Source: Samsung

None of this is easy, though. There will be significant challenges both in manufacturing this technology, and in fully utilizing it. Unlike in the past, when one advanced architecture could be leveraged across billions of units, many of these designs are bespoke. In the AI world, for example, pretty much everyone is building their own custom AI training chip and focusing on HBM. It’s being used in one of two ways — either as the only memory in the system, or with accompanying DRAM.

Fujitsu’s Arm-based A64fx is an example of the former. Used in Fugaku, for a time the world’s fastest supercomputer, the A64fx had 32GB of HBM2 on the die right next to the CPU, but no other system memory. Others, like AMD Instinct, Nvidia’s H100 GPU, and Intel’s CPU Max and GPU Max, have HBM accompanied by standard DRAM, where the HBM acts like a massive cache for the DRAM.

Enemy No. 1: Heat
The biggest challenge to using HBM is heat. It’s well understood that memory and heat don’t go together, but HBM3 is going to be used alongside the hottest chips and systems in the world. Nvidia’s H100, for example, has a 700-watt thermal design power (TDP) limit.

Macri said that with Frontier, the supercomputer at the Oak Ridge National Laboratories — a mix of Epyc CPUs and Instinct GPUs (using HBM2E) — required AMD to do some creative load balancing to keep the temperatures within limits.

Fig. 2: The Frontier supercomputer. Source: Oak Ridge National Laboratories

Some workloads on Frontier are memory-intensive and some are CPU-intensive, and balancing workloads to avoid overheating is done in silicon, not software. “There are microprocessors for which their whole job is to manage these control loops, keeping the system at the best possible place,” said Macri.

Frontier was built by HPE’s Cray division, in partnership with AMD, and load balancing to manage thermals is handled at the system design level. “We co-engineered the solution,” he said, “to be dynamically manipulated to yield the most performance, depending on the work that’s being done.”

There are hardware features within both HBM and the controller that allows it to throttle the memory and put it into different performance states, or even shift to lower frequencies, said Frank Ferro, senior director of product management at Rambus. “If there’s starting to become a hotspot and you want to reduce the frequency or reduce the power and put the memory in idle modes, those are all fundamentally at the IP level and the DRAM level. At the DRAM level, you’ve got that capability, but how you use it is up to the system architects.”

Density limits
The second thermal challenge facing HBM3 is from the memory itself. The HBM3 standard allows for as many as 16 layers, compared to the 12-layer limit of HBM2E. But Macri believes it will stop at 12 layers due to heat. Still, that could vary from one vendor to the next, and from one use case to the next for customized designs.

The bottom DRAM in the stack has the highest thermal impedance, which is the main limiter to stacking. HBM uses micro-bumps to connect the different DRAM dies, and micro-bumps have their shortcomings. With memory generating heat, that heat can build up at every level, and micro-bumps are not effective at transferring the heat out. That, in turn, limits the practical number of levels of DRAM. So even if HBM3 can support 16 layers, in most cases fewer layers will be utilized.

Each layer of DRAM needs its own power delivery, and that needs to be sufficient to get sufficient performance. Pushing that power delivery increases heat at every layer.

2.5D for now
The HBM interposer has remained at 2.5D not because of heat challenges as well. The 2.5D design is why the memory is sitting beside the processor. In a true 3D design, the memory would sit on top of the CPU/GPU and talk directly to the chip. With CPUs topping out at 300 watts and GPUs hitting 700 watts, the heat is too much.

“The challenge is if you are generating a lot of heat, you’re stepping on top of micro bumps that really don’t do a good job transferring the heat out. And so that’s why pretty much everyone has done 2.5D, because the micro-dot technology really limits the amount of power you can put into the die underneath it,” said Macri.

Some of that will change will full 3D-IC implementations. “This physical layer would get less complicated if you are 3D, so there are a lot of advantages,” said Ferro. “You get rid of the interposer. The physical interface between the chips becomes less complicated because you’re not traveling through another medium to connect. So there are a lot of advantages, but there also are a lot of challenges.”

For example, cooling a 3D-IC is difficult using existing technology because the memory sitting on top of the chip literally insulate the ASIC or GPU underneath it. In a planar SoC, that heat is dissipated by the silicon itself. But in 3D-ICs, more elaborate approaches need to be utilized, in part because the heat can get trapped between layers, and in part because the thinned die used in those devices cannot dissipate as much.

“The moment you put a memory die stack on top of the GPU, the GPU heat needs to go through the memory before it gets dissipated, or before it hits the cold plate. So you’ve got a different challenge now all of a sudden,” said Girish Cherussery, senior director of HPM product management at Micron Technologies. “I don’t think I’m going to see something which takes the existing HBM and stacks it directly on top of a GPU or an ASIC that’s burning 400, 500 watts of power. But will it happen in the future? This is a solution that can come to fruition.”

Dunking chips
It’s part of a bigger problem and challenge of how do you keep these data centers cool and power efficient, and thermal solutions are one of the bottlenecks to keep the environment sustainable. “And immersion cooling seems to be one of those solutions that that the industry is looking at,” notes Cherussery,

Immersion cooling may be the only real solution, because it doesn’t use a cold plate like air and liquid cooling do. It involves immersing the motherboard, complete with CPU and memory, in a non-conductive dialectic fluid – oftentimes mineral oil – with just the NIC, USB, and other ports sticking out of the fluid.

This is especially important in data centers, where cooling racks of servers can cost millions of dollars annually. The average Power Usage Effectiveness (PUE) rating for American data centers is about 1.5. The lower the score, the more efficient the data center, but it cannot go below 1.0. Every point above 1.0 is power used to cool the data center, so at a PUE of 1.5, the data center is spending half the amount of total power on cooling.

Immersion can be tremendously efficient. A Hong Kong data center achieved a PUE rating of 1.01. Liquid cooled data centers have gotten down to the 1.1 range but 1.01 was unheard of. That data center was using just 1% of its power on cooling.

Immersion cooling has long been on the fringe of cooling techniques and only used in extreme cases, but it is slowly going mainstream. The company behind the Hong Kong data center, LiquidStack, has gotten some VC funding, and Microsoft has documented its experiments with LiquidStack products at one data center. Microsoft gained power use efficiency, but it also found it could overclock the chips without damaging them. So there is a very real possibility that a future path for true 3D stacking of HBM might be through a tank full of mineral oil.

Variations among vendors
Macri noted that DRAM vendors are competing against each other, just like SoC vendors, and that means some HBM is lower power and some higher power. “There’s good competition everywhere. That’s important, because it drives it drives innovation,” he said.

That wiggle room can also lead to problems. There is no standard when it comes to specifying power, he said. Each DRAM maker is coming up with the best way to design the memory to achieve the best end result, with power and price as the key variables.

“The better stuff costs more than stuff that isn’t quite as good and that’s important also, because there are different system targets, depending on the company and what they’re using it for,” said Macri.

However, the DRAM itself does conform to JEDEC standards. So in theory, you should be able to take a DRAM from one vendor and replace it with another, which limits that difference.

“Is there a lot of overlap and similarity for what we do? Absolutely,” said Ferro. “Are they exactly the same? Maybe it’s a little different, but it’s not fundamentally different. You have to you have to go through the process with each of the vendors, because there might be a little bit of variation.”

The testability and RAS (reliability, availability, scalability) capabilities have improved significantly since HBM2E. HBM3 also deals with the need to have on-die ECC within the DRAM just to just to make the DRAM that’s very reliable. “That’s very, very important, because any error that’s generated requires you to do a return or fix it up, which adds latency,” he said.

Other challenges
Because HBM is tied 2.5D for the time being, that adds a size limitation for memory. The size of the SoC, combined with the number of HBM chips, adds up to a larger area to cool.

“That’s another challenge we deal with,” said Ferro. “You just can’t get bigger because things will just fall off the package. So we have to pay a lot of attention to making sure our aspect ratio was correct, and that we don’t exceed any of these size limits.”

In using HBM, you want to take advantage of its biggest attribute, which is bandwidth. But designing to take advantage of that bandwidth isn’t easy. “You want very dense floating point units, and that’s challenging,” said Macri. “DRAM doesn’t love random access. So you want to design your reference pad such that it’s very friendly to the HBM. You extract the most efficiency out of it, and that’s very difficult.”

HBM3 offers several improvements over the HBM2E standard. Some were expected (bandwidth bump), some unexpected (RAS improvements, updated clocking methodology). All told, the new standard offers users a significant improvement to HBM memory for the next generation of SoCs. But so far, at least, it’s not a plug-and-play solution.

Leave a Reply

(Note: This name will be displayed publicly)