Low Power-High Performance

HBM Takes On A Much Bigger Role

High-bandwidth memory may be a significant gateway technology that allows the industry to make a controlled transition to true 3D design and assembly.

May 13th, 2021 - By: Brian Bailey

High-bandwidth memory is getting faster and showing up in more designs, but this stacked DRAM technology may play a much bigger role as a gateway for both chiplet-based SoCs and true 3D designs.

HBM increasingly is being viewed as a way of pushing heterogenous distributed processing to a completely different level. Once viewed as an expensive technology that only could be utilized in the highest-value designs, its application is broadening. It has progressed to the point where some people in the industry see it as the best option to move the industry into the next stage of system development. In fact, betting against HBM probably is not a wise move.

Originally released in 2013, HBM was a new memory interface that utilized stacked SDRAM connected to a processor via a silicon interposer. The connections between the memory and the SoC are made by metal layers on that interposer. It can be thought of either as a large chip itself, where the other die are flipped over and connected by microbumps, or as a PCB within the package. This is often called 2.5D integration.

That may be just the first step, though. A silicon interposer is capable of having active circuitry built into it, even if it is unlikely to be on an advanced node. By using the interposer as more than passive interconnect, designs would become free to utilize the Z axis for true 3D integration.

It took a couple of years before the first example of HBM was in production usage, and by 2016 the second generation of the interface was released. It saw much faster adoption. In 2018, it received another speed and capacity boost (HBM2e), and the first examples of HBM3 have been announced, even though the standard is not yet complete.

Graham Allan, senior manager for product marketing at Synopsys, explains the performance of the technology today. “HBM2 has gone through a couple of performance bumps. It started at 2.4 Gb/s per pin bandwidth. That got bumped up to 3.2 and then to 3.6. If you look at any metric for the performance of HBM compared to an off-chip memory subsystem — including bandwidth, performance efficiency (expressed in picoJoules per bit), the amount of beachfront required on an SoC so that you don’t become limited, and the total area taken on your SoC — HBM wins every time. The only place where HBM does not currently win is in terms of overall product cost, and that is because HBM is a 2.5D interposer-based technology.”

Various DRAM interfaces exist today for connecting memory to processors (see figure 1) and each of them excels in certain areas.

Fig. 1: Comparison of various memory interfaces. Source: Synopsys

HBM’s roadmap is aggressive. “You’re going to see HBM3 doubling capacity, doubling speed versus where we are today,” says Synopsys’ Allan. “We can make similar projections for other technologies (see figure 2). This demonstrates the runway for HBM. It just gets further and further out in front of the pack.”

Fig 2. Projections for future memory interfaces. Source: Synopsys.

Fig 2. Projections for future memory interfaces. Source: Synopsys.

This is in line with other people’s views. “The second generation of high-bandwidth memory (HBM2E) is currently available and can achieve bandwidths of at least 410GB/s per package, with a 3.2Gbps data rate and wide (1024-bit) data bus,” says Wendy Elsasser, distinguished engineer at Arm. “Work is ongoing to develop HBM3, the next generation of high bandwidth memory, and one can expect this technology to have increased bandwidth capability with a traditional uptick of 2X between DRAM generations.”

Still, it will take time for economies of scale to kick in. “The DRAMs themselves are smaller volumes than the mainstream DDR5 and LPDDR5, and the cost of DRAMs are always directly proportional to volume. Until the volumes ramp up more significantly, they will tend to be more expensive than DDR5 or LPDDR5, but that price gap is narrowing significantly every quarter.”

Application areas
Originally developed as a technology for graphic processors, its application has spread. “There is a lot of HBM usage today in devices of all kinds,” says Mike Thompson, senior product line manager at Xilinx.

“When customers need integrated HBM they really, really need it. There is usually some critical problem they are trying to solve. Probably the top three are data center applications with compute acceleration, machine learning (ML) data pre-processing and buffering, and database acceleration and analytics. In addition, I also see a lot of pull from wired communications, for security appliances — especially next generation firewalls — search-and-look-up applications for network filtering or load balancing, routing applications, and 400 gig switches and routers. Test and measurement is using it for network testers, packet capture, and data-capture type of equipment.”

General compute also is adopting it. “You can treat HBM as an L4 cache to the processor, in between the DDR interface and the processor,” says Allan. “We are starting to see customers looking for both DDR5 and HBM interfaces on the same SoC. DDR5 is used where you need capacity providing 4TB of potential storage in one blade. HBM, realistically, you could get six HBM die around an SOC where each of those is 32GB.”

As the potential for the technology grows, different markets may want to focus on various aspects. “Today 8GB or 16GB HBM 2.0 using 1xmn process technology is very popular,” says Jacky Tseng, DRAM marketing department manager for Winbond Electronics. “But there are some applications that would like to see lower density, like 1GB or 2GB. These applications are hungry for bandwidth, and they will be able to utilize 100GB/s in the future.”

Other applications need more storage. “The benefit of HBM is really its high bandwidth,” says Michael Frank, fellow and system architect at Arteris IP. “If you have a working data set that fits, it’s fine. To consume that much bandwidth, you are likely to use a decent amount of silicon area to process it. But HBM does not providing the low latency that you get from SRAMs. You have to look at your application. What is your algorithm? In many systems, you sequentially process a lot of data, mostly with the same kind of processing scheme. It’s like SIMD or streaming. Machine learning is typically something like this, where you have large data sets and weights. But HBM is still limited in capacity, and the price is relatively high.”

The next generation
While HBM3 has not yet been announced, the industry already is moving ahead with pre-standard versions of it. One example is OpenFive, which recently announced an SoC that features an HBM3 subsystem. Its HBM3 interface is specified to be 7.2Gbps.

“Achieving power neutrality must be considered for the development of the HBM3, in addition to data probity to ensure reliable data and fault detection capabilities,” says Arm’s Elsasser. “A final consideration is increased capacity – a combination of per-die capacity and number of dies stacked per package, which are constrained by process, thermals, and package height.”

Thermal is becoming a big issue. “Thermal constraints are one of the things that are driving HBM adoption in the first place,” says Xilinx’s Thompson. “If you compare the power consumption to external commodity memories, HBM solutions run at 75% lower power. Compared to the external memories, HBM takes a lot of the power out of the I/O, which traditionally is where memory consumed a lot of their power — not just in the die themselves, but in the I/O.”

But the total heat of the package has to be considered. “Things get warm,” says Arteris’ Frank. “There is nothing that DRAMs hate more than heat, except possibly alpha particles. Heat is the biggest enemy they have. What we call Refresh is because of leakage, and this increases when things get hotter.”

This is driving advances in packaging. “Whereas 10 years ago we would have lidded flip-chip packages across the board (figure 3a), now we’re seeing lidless packages with stiffener rings and co-planar dies integrated internally (figure 3b),” says Thompson. “A lidless flip chip package means that there’s no lid on the device, and the silicon is directly exposed. That makes it possible to get the heatsink in direct contact with the silicon itself. In many cases we see 10 to 20 degrees Celsius lower junction temperature because of the higher efficiency of the heat sink. Co-planarity also matters. This means the tops of the silicon are all in one plane. They are all at the same height. If the dies within a lidless package are not at the same height, that means that it is either challenging, or impossible, to get a heat sink to make contact with the dies itself, hence reducing thermal dissipation.”

Fig. 3a: Traditional packaging technology. Source: Xilinx

Fig. 3a: Traditional packaging technology. Source: Xilinx

Fig. 3b: Lidless packaging technology. Source: Xilinx

Fig. 3b: Lidless packaging technology. Source: Xilinx

Looking ahead
HBM is starting to become more than just a memory interface. Increasingly, it is an enabler and test vehicle for some advancements that could transform the industry.

HBM is essentially a successful demonstration of chiplets. It also is being used to demonstrate the ability to enable processing in memory, and it is enabling levels of integration that make whole new classes of products available.

“HBM3 technology is just chiplet technology done a different way,” says Allan. “One of the chiplets is a memory, and the other chiplet is the processor die. A lot of people are choosing to use interposers for both their chip-to-chip communication, as well as their communication to external DRAM, because then you get a two for one benefit. Once you have an interposer, you can have multiple die fulfilling the compute requirements, and multiple die for the memory requirements, all in the same package, and at no additional cost.”

HBM itself is a 3D stack of die. “HBM3 seems to be focused on true 3D assembly,” says Thompson. “That will help address the area scaling of devices that we are able to fabricate in the industry as a whole. When we’re making interposers and dies that are hitting reticle limits, there is certainly a challenge, but that is driven because die and interposers are fundamentally 2D solutions. With HBM3, we’ll be able to jump into true 3D solutions, and so significantly mitigate hitting reticle limits or die size and interposer limits.”

Another perhaps unexpected twist came from Samsung recently. The company integrated Processing-in-Memory (PIM) with High Bandwidth Memory (HBM). PIM is able to process some of the logic functions by, in this case, integrating an AI engine — which Samsung calls a Programmable Computing Unit (PCU) — into the memory core. According to Samsung, PIM in AI applications, such as speech recognition, has shown a 2X increase in performance compared to existing HBM and reduced energy consumption by 70%.

Samsung believes this will stimulate growth in the use of AI applications that require continuous performance improvements, such as mobile, data centers and HPC.

This is made possible by the physical way that HBM memory stacks are created. “HBM usually has a base die that has the controller and some other interface logic, and then come the stacks of DRAM on top of it,” says Frank. “If you have an application that can take advantage of it, building a stack that has some processing on the bottom layer makes some sense. Basically, instead of your controller die, that is the interface for your system. You build a processing die as the first layer. This is a bold move because you have to build enough of those to justify the effort, but this is a way to get even better bandwidth because you’re not relying on the HBM interface that goes out to your logic die, and you don’t waste that power for moving the data that you processing directly within the stack. Having said that, if you’re processing enough, you meet your friend — power.”

Attempts to create near- or in-memory computer have been around for a long time, but HBM may provide the platform for a successful adoption. “That direction does make sense,” says Thompson. “I see a future with massive amounts of AI processing, massive amounts of memory, massive amounts of adaptable compute logic fabric, as well as massive amounts of high-speed connectivity converging.”

But there are other challenges associated with PIM. “It can be easier to move the processing to the data, especially if you have huge quantities of data,” says Steven Woo, fellow and distinguished inventor at Rambus. “But the challenge is that is has to make business sense for folks who want to try to adopt it. It also has to make technical sense in terms of being able to either rewrite applications or convert applications that you have today. It is not necessarily as straightforward to take existing applications and translate them to work on a new kind of architecture.”

Thinking outside the box
Looking at HBM as just a DRAM interface may be way too limiting. “You could build something like an HBM stack made with MRAM or ReRam,” says Frank. “That could be an interesting product, because technically you have built a complete disk from that. If the memory is non-volatile, you can build a complete stack where the lowest level is the logic, and then you stack the memory layers on top of it. Especially with magnetic memory or spin torque transfer (STT) memory, you may end up having a much better behavior because this material is more resilient to a little bit of heat. In fact, bits flip easier, so your write energy, which is the main problem of STT, might go down when the stack gets warm, and you have the advantage of non-volatility.”

Continued integration leads to new applications. “Designs are physically getting smaller and lower power while increasing processing capacity,” says Thompson. “That makes it possible to build devices with a significant amount of intelligence in places where they haven’t been able to fit before, such as a small form-factor PCIe card. In the past that would have required multiple high-end devices and it simply wouldn’t fit in a small form factor. I am starting to see a huge pull for many new compute and security applications that are challenged by memory and bandwidth bottlenecks, challenged by form factor, power and thermal constraints. Hyper adjacency of compute fabric, which can be processors or logic, massive amounts of memory available at high bandwidth, as well as very high-speed connectivity, are enabling solutions that had really hit dead ends with traditional solutions.”

So while HBM currently is a way to increase memory bandwidth for graphics processors, ultimately it may finish up being the gateway technology that allows the industry to make a controlled transition from 2D to 3D solutions.

Related
HBM Knowledge Center
Top stories, white papers, videos, blogs and more about HBM
Ensuring HBM Reliability
What can go wrong and how to find it.
Difficult Memory Choices In AI Systems
Tradeoffs revolve around power, performance, area, and bandwidth.
HBM2E: The E Stands For Evolutionary
The new version of the high bandwidth memory standard promises greater speeds and feeds and that’s about it.

Brian Bailey

(all posts)
Brian Bailey is Technology Editor/EDA for Semiconductor Engineering.

1 comments

peter j connell says:

May 15, 2021 at 7:50 am

Fascinating. Thanks.

HBM Takes On A Much Bigger Role

Brian Bailey

1 comments

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

Sponsors

Recent Comments

About

Navigation

Connect With Us

HBM Takes On A Much Bigger Role

Brian Bailey

1 comments

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored