New levels of system performance bring new tradeoffs.
An insatiable demand for bandwidth in everything from high-performance computing to AI training, gaming, and automotive applications is fueling the development of the next generation of high-bandwidth memory.
HBM3 will bring a 2X bump in bandwidth and capacity per stack, as well as some other benefits. What was once considered a “slow and wide” memory technology to reduce signal traffic delays in off-chip memory is becoming significantly faster and wider. In some cases, it is even being used for L4 cache.
“These new capabilities will enable the next level of energy efficiency in terms of joules per transferred bit and more designs with an HBM3-only memory solution with no additional off-package memory,” said Alejandro Rico, principal research engineer at Arm. “Applications in AI/ML, HPC, and data analytics can take advantage of the additional bandwidth to keep scaling performance. The proper exploitation of HBM3 bandwidth requires a balanced processor design with high-bandwidth on-chip networks and processing elements tuned to maximize data rates with increased levels of memory level parallelism.”
AI training chips typically need a terabyte of information of raw data processing, and HBM3 is getting up to that level, noted Frank Ferro, senior director, product marketing, IP cores at Rambus. “Users are pushing for more bandwidth as they develop ASICs to build a better mousetrap to solve the AI problem. Everybody’s trying to come up with a more efficient processor to implement their specific neural networks and implement those more efficiently with better memory utilization, better CPU utilization. For AI training, HBM has been the choice because it gives the most bandwidth, the best power, and the best footprint. It’s a little more expensive, but for these applications — especially going up into the cloud — they can afford it. There’s really no real barrier there, especially with multiple HBMs on an interposer. HBM3 is really just a natural migration.”
Fig. 1: I/O speed of different HBM versions. Source: Rambus/SK Hynix
While JEDEC has not released details on the yet-to-be-ratified HBM3 specification, Rambus reports its HBM3 subsystem bandwidth will increase to 8.4 Gbps, compared with 3.6 Gbps for HBM2e. Products that implement HBM3 are expected to ship early 2023.
“HBM3 is beneficial where the key performance indicator for the chip is memory bandwidth per watt, or if HBM3 is the only way to achieve the required bandwidth,” said Marc Greenberg, group director of product marketing, IP group at Cadence. “That bandwidth and efficiency come at a cost of additional silicon in the system compared to PCB-based approaches like DDR5, LPDDR5/5X, or GDDR6, and potentially higher manufacturing/assembly/inventory costs. The additional silicon is usually an interposer, as well as a base die underneath each HBM3 DRAM stack.”
Why this matters
In the decade since HBM was first announced, 2.5 generations of the standard have come to market. During that time, the amount of data created, captured, copied, and consumed increased from 2 zettabytes in 2010 to 64.2 ZB in 2020, according to Statista, which forecasts that number will grow by nearly three times to 181 ZB in 2025.
“In 2016, HBM2 doubled the signaling rate to 2 Gbps and the bandwidth to 256 GB/s.” said Anika Malhotra, senior product marketing manager at Synopsys. “Two years later, HBM2E came on the scene and ultimately achieved data rates of 3.6 Gbps and 460 GB/s. Performance hunger is increasing, and so are the unrelenting bandwidth demands of advanced workloads because higher memory bandwidth is, and will continue to be, a critical enabler of computing performance.”
Alongside of that, chips designs are becoming more complex in order to process all of this data more quickly, often with specialized accelerators and on-chip and in-package memories and interfaces. HBM increasingly is viewed as a way of pushing heterogenous distributed processing to a completely different level, she said.
“Originally, high-bandwidth memory was seen by the graphics companies as a clear step in the evolutionary direction, but then the networking and data center community realized HBM could add a new tier of memory in their memory hierarchy for more bandwidth, and all the things that are driving the data center — lower latency, faster access, less delay, lower power,” Malhotra said. “Typically, CPUs are optimized for capacity, while accelerators and GPUs are optimized for bandwidth. However, with the exponentially growing model sizes, we see constant demand for both capacity and bandwidth without tradeoffs. We are seeing more memory tiering, which includes support for software-visible HBM plus DDR, and software transparent caching that uses HBM as a DDR-backed cache. Beyond CPUs and GPUs, HBM is also popular for data center FPGAs.”
HBM originally was intended as a replacement for other memories like GDDR, driven by some of the leading semiconductor companies, particularly Nvidia and AMD. Those companies are still heavily involved in driving its evolution in the JEDEC Task Group, where Nvidia is the chair and AMD is one of the main contributors.
For GPUs, there are two options today, said Brett Murdock, product marketing manager, memory interface IP at Synopsys. “One option still uses GDDR, with a boatload of devices around the SoC. The other option is to use the HBM instead. With HBM, you’ll get more bandwidth and fewer physical interfaces to deal with. The tradeoff is higher cost overall. Another advantage is that with fewer physical interfaces, there is lower power consumption. GDDR is a very power-hungry interface, but HBM is a super power-efficient interface. So at the end of the day, the real question the customer is asking is, ‘Where’s the priority for me spending my dollars?’ With HBM3, that is really going to start tipping the balance toward, ‘maybe I want to spend those dollars on HBM.’”
Although relegated to certain niches when it originally was introduced, with AMD and Nvidia as the only users, HBM 2/2e now has a very large installed base of users. This growth is expected to widen considerably when HBM3 is eventually ratified by JEDEC.
Critical tradeoffs
Chipmakers have made it clear that HBM3 makes sense when there is an interposer in the system, such as a chiplet-based design that already was using the silicon interposer for that reason. “However, in many cases where there is not already an interposer in the system, a memory-on-PCB solution like GDDR6, LPDDR5/5X, or DDR5 may be more cost-effective than the addition of an interposer expressly to be able to implement HBM3,” said Cadence’s Greenberg.
As economies of scale take effect, those kinds of tradeoffs may become less of an issue, however. Synopsys’ Murdock said the biggest considerations for the user with HBM3 are managing the PPA, because for the same bandwidth compared with GDDR, an HBM device will have smaller silicon area, with lower power, and fewer physical interfaces to deal with.
“Also, with an HBM device on the IP side of things, compared with DDR, GDDR, or LPDDR interfaces, how you implement them physically on the SoC is the Wild West,” Murdock said. “You can put a full linear PHY on the side of the die. You can wrap around a corner. You can fold it in on itself. There’s an untold number of ways you can implement that physical interface. But with HBM, when you’re dropping down one HBM cube, JEDEC has defined exactly what the bump map on that cube looks like. Users will place that on the interposer, and it will sit right next to the SoC, so really there’s only one viable option for how to build the bump map on the SoC — that’s to match the HBM device. And that drives the physical implementation of the PHY.”
These decisions also impact reliability. While there may be less flexibility as far as where the bumps go, that increased predictability can mean higher reliability, as well.
“There are a few different choices in the interposer for how to connect those things together, but at the end of the day, if I look at GDDR, LPDDR, or DDR, I can build a million different boards and connect them in a million different ways,” he said. “This results in a million different implementations, and a million different opportunities for somebody to mess something up. With HBM, you put in the PHY, you put in the device, and the interposer between those two is straightforward. That interposer connection is going to look the same for Nvidia as it does for AMD, Intel, or whoever else is going to do it. There’s not a lot of variability for how you’re going to do that other than some minimum spacing rules between your SoC and your HBM device. That’s pretty much it. This should lead to work with the tools teams for 3D IC to enable quick routing of the interposer between the two devices, because there can’t really be a whole ton of variability in how you’re going to do that.”
Another contributing factor to reliability is how many times something has been done. “The fact that we’re doing the same thing for every customer, or almost the same thing, means we’re really good at it, and it’s tried and true. I know it’s worked for AMD and millions of units that they ship, so why would it be any different for this new AI customer that we’re selling HBM to for the first time? We’re not going to have to reinvent anything,” Murdock said.
Especially with the complexities that 2.5D and 3D bring, the more variables that can be eliminated, the better.
Not surprisingly, power management is the top consideration in AI/ML applications where strong adoption is expected for HBM3, Synopsys’ Malhotra said. “This is true for the data center as well as for edge devices. Tradeoffs revolve around power, performance, area, and bandwidth. For edge computing, the tradeoffs continue to grow in complexity, with a fourth variable added to the traditional PPA equation — bandwidth. In a processor design or accelerator design for AI/ML, in figuring out the power, performance, area, bandwidth tradeoff, a lot depends upon the nature of the workload.”
Making sure it works
While HBM3 implementations may seem straightforward enough, nothing is simple. And because these memories are often used in mission-critical applications, ensuring they work as expected requires additional work. Joe Rodriguez, senior product marketing engineering for IP cores at Rambus, said that post-silicon debug and hardware bring-up tools that many vendors provide should be used to make sure the entire memory subsystem is performing as it should.
Users typically leverage the testbench and simulation environment provided by the vendor so they can take the controller and start running simulations to see how well the system is going to perform with the HBM 2e/3 systems.
“While looking at the overall system efficiency, the physical implementation always has been a challenge with HBM because you have such a small area,” Rambus’ Ferro said. “That’s a good thing, but now you’ve got a CPU or a GPU, you’ve got maybe four or more HBM DRAMs, and you’ve got that on a very small footprint. This means the heat, the power, the signal integrity, the manufacturing reliability are all issues that have to be worked on in the physical design implementation.”
Fig. 2: 2.5D/3D system architecture with HBM3 memory. Source: Rambus
In order to get the most performance out of the interposer and package design, even at HBM2e where speeds are 3.2 and 3.6, a lot of companies struggled to get good signal integrity through the interposer. To complicate matters, each foundry has different design rules for those interposers, some more challenging than others.
“With HBM3, they’re increasing the number of layers, and increasing the capabilities of the interposer — the dielectric thickness, etc. — to make that problem a little easier to solve,” Ferro said. “But even in the previous generation, a lot of customers are scratching their head saying, ‘How do you get this thing to run at 3.2 gigabits per second?’”
Conclusion
The road to higher memory bandwidth will continue for the foreseeable future, but the upcoming introduction of HBM3 is expected to open a new phase in system design, taking system performance to a new level.
To achieve this, industry players must continue to address the design and verification requirements of data-intensive SoCs with memory interface IP, and verification solutions for the most advanced protocols like HBM3. Taken as a whole, these solutions should be stitched together to allow specification compliance to be verified for both protocol and timing checkers, along with coverage models that ensure no bug escapes will happen.
Related
HBM Knowledge Center
Top stories, videos, special reports, white papers and more on HBM
HBM Takes On A Much Bigger Role
High-bandwidth memory may be a significant gateway technology that allows the industry to make a controlled transition to true 3D design and assembly.
Architecting Interposers
It’s not easy to include interposers in a design today, but as the wrinkles get ironed out, new tools, methodologies, and standards will enable it for the masses.
Will Monolithic 3D DRAM Happen?
New and faster memory designs are being developed, but their future is uncertain.
Leave a Reply