HBM Upstages DDR In Bandwidth, Power

Design challenges and tool flow gaps emerge, but so do real-world PPA metrics.

popularity

For graphics, networking, and high performance computing, the latest iteration of high-bandwidth memory (HBM) continues to rise up as a viable contender against conventional DDR, GDDR designs, and other advanced memory architectures such as the Hybrid Memory Cube.

HBM enables lower power consumption per I/O and higher bandwidth memory access with a more condensed form factor, which is enabled by the stacking memory dies directly on top of each other and sharing the same package as an SoC or GPU core using a silicon interposer. Each pin drives a much shorter trace, and runs at a lower frequency, which in turn results in significantly lower switching power.

The high bandwidth performance gains are achieved by a very wide I/O parallel interface. HBM1 can deliver 128GB/s, while HBM2 offers 256GB/s maximum bandwidth. Memory capacity is easily scaled by adding more dies to the stack, or adding more stacks to the system-in-package.

Screen Shot 2017-03-21 at 7.59.14 PM
Fig. 1: HBM1 vs. HBM2. Source: SK Hynix

Companies working with HBM are just starting to report real-world performance numbers. eSilicon has found an 8X bandwidth increase with HBM over other DDR approaches. And Synopsys has found that comparing the relative energy per bit just in the physical interfaces of DDR4 and HBM2 there is a 5X reduction in energy per bit just in the data transfer part.

That has prompted a spike in design activity around HBM. According to Frank Ferro, a senior director of product management at Rambus, what’s driving this is the ability to take existing DRAM with 2.5D technology and move it closer to the processor using a fatter data pipe, speeding up data throughput, reducing the amount of power necessary to drive a signal, and cutting RC delay.

“Originally, high-bandwidth memory was seen by the graphics companies as a clear step in the evolutionary direction,” Ferro said. “But then the networking and data center community realized HBM could add a new tier of memory in their memory hierarchy for more bandwidth, and all the things that are driving the datacenter: lower latency access, faster access, less delay, lower power. As a result, HBM design activities have picked up pace in these market segments.”

HBM design challenges
Along with the benefits come the challenges, however, and there are quite a few introduced in HBM designs, pointed out Calvin Chow, senior area technical manager at Ansys. “One of these includes accounting for the power noise impact from HBM I/Os. Even though the power per pin is lower, there are many more I/Os firing in parallel, resulting in significant increase in current consumption. Though the signal traces are shorter, there is still a noise concern due to the simultaneous switching of a large number of I/Os.”

Given that there are more than a thousand PHY I/Os in an HBM design, testing/debugging an HBM design is more challenging than traditional DDR based designs because it’s difficult to probe the microbumps of the thousands of PHY I/Os, he explained. “Couple that with the fact that there are multiple parts coming from potentially different sources—memory and controller dies, the SoC die, silicon interposer, and package—and you could have a very challenging time debugging a failure or performance issue. It’s extremely important to accurately model each component in a holistic simulation during the design process.”

Also, with the large current draw in a significantly smaller form factor, thermal implications are another key focus area when designing with HBMs.

“In fact, the thermal characteristics of the HBM package are extremely important such that we had to do multiple thermal test vehicles to make sure that what we were going to be recommending would not cause the HBM stack itself to start failing,” explained Deepak Sabharwal, general manager of eSilicon’s IP products and services. “The power integrity is also very important. You have to make sure that you’ve got proper meshing going on inside the die and connections to all these microbumps so you do not have high resistance because you’re concentrating too much current in one area versus another.”

Chow agreed. “The stacked memory dies tend to superimpose heat as it creates an obstructed heat flow path. Not only can heat impact performance, but thermal integrity must be verified to ensure wires and vias meet the desired lifetime requirements. One should also be aware of how heat can couple from SoC to memories, and vice versa. Package and silicon interposer layers may also be vulnerable to thermal-induced mechanical stress. Thus, understanding the thermal effects of the HBM design is critical in determining the cooling needed at the system level.”

Screen Shot 2017-03-21 at 8.02.03 PM
Fig. 2. HBM connected to interposer. Source: Samsung.

Power/performance tradeoffs
From designer’s perspective, designing HBM is different than what they’ve done in the past for the stacked structure, said Marc Greenberg, director of product marketing for DDR controller IP at Synopsys. For one thing, HBM is a multi-channel architecture. On top of that, there is a question about how to utilize all of the bandwidth in HBM.

Engineering teams still face the traditional power/performance tradeoffs that they always have, but HBM helps the situation on an energ-per-bit basis, Greenberg said. “HBM is quite efficient on the energy used to transfer each bit of data between the DRAM and the SoC for a few reasons. The nature of the interface is structured such that the connections are really short, are in a very good channel, and have very low capacitance associated with them, so you really just push a very small signal a very short distance. Unlike DDR4, where the signal has to go from the SoC, through a package, into a PCB, across a PCB, into a DIMM socket, turn a corner to get into the DIMM socket, go up onto a DIMM, fan out across a DIMM, and then go into the DRAM— there’s a lot of impedance and capacitance associated with that path. There’s less of it in HBM because the paths are much shorter, much better controlled.”

Further, fewer signal conditioning techniques need to be used in the HBM interface than in the DDR interface. These techniques take power in the DDR case, and they are just not necessary in the HBM case, he said.

When it comes to making the tradeoffs, Lou Ternullo, product marketing group director for memory, storage and interface IP at Cadence reminded that there are typically two knobs that can be turned to increase raw bandwidth. One is the clock frequency — the higher the datarate, the higher the bandwidth. The other is the number of data pins that can be accessed simultaneously. For example, if the system is 32 bits, at the same frequency if you double the number of bits to 64, the bandwidth is effectively doubled; so HBM is actually 1,024 bits — which is significantly wider than any other DRAM in the market today.

So how can this be done, and still reduce power? He asserted the power aspect is relative, and it has to be looked at in a relative context otherwise it’s not apples to apples. “For example, if there are two different types of memories — A and B, for discussion — and they are run at the same clock frequency, for the same bandwidth you’re hoping to get lower power — that’s what you’re trying to achieve in this case. The way they did it, is by kind of playing with the laws of physics to get that to work. By designing the whole ecosystem around the microbumps and the silicon interposer, and so forth, the technology allows transmission at a fairly high datarate without requiring an excessive amount of termination, and that reduces the power, hence, the bandwidth gains are achieved. That’s the laws of physics at play.”

Of course all of this complexity adds to the BOM, Ternullo noted. “The silicon interposer, which is in itself a new technology, adds more complexity, and the bill of materials increases. You need to then look for applications that are high bandwidth, with a reasonable amount of margin in the product to support a higher cost system. The system will be more costly compared to just buying regular DRAM and putting it on a board. Essentially what we’re seeing is networking applications, high performance computing, those types of very high bandwidth demand, as well as high margin products. Taking the cost factor out of it, and just looking at the bandwidth and the problems that could solve, there could theoretically be other applications — it’s a matter of the price point of the solution coming down enough to fit in other markets.”

Gaps in tool flows

Despite the advantages, however, design tools are lagging when it comes to HBM.

“There are tons of gaps today,” Sabharwal said. “What you are doing today on a PCB, where you put the ASIC, you put some other DRAM on a PCB and hook them all up. Now you are taking that level of integration and putting it inside one package. This is stuff that a traditional ASIC designer would not have to worry about. The chip would be created with the pin out. Now, it’s the designer’s job to connect the two chips, and do all the board design work. Everything that used to be on a PCB now the designer has to make sure the interposer is designed to run all those tracks. Today, there is no tool available to do this.”

There is no way to extract the interposer properly, so commercial tools are used to model inductive effects. There are new tools on the way, but design teams have gone back to using traditional field solvers to get the true R&C characteristics of the interposer to attack the extraction challenge, Sabharwal said.

Another gap in the tool flow is the lack of ability to run LVS (layout versus schematic). As Sabharwal explains, “Inside a chip, I can run LVS, and make sure that I’ve got my connectivity all figured out, and done correctly. But when I’m putting two chips together inside the package, how do I make sure that I’ve got that connectivity done correctly? There are routing signals between the two chips, and LVS can’t be run in that environment so engineers come up with their own homegrown scripts, create bump-out patterns from one chip, and for the other chip they get coordinates, then write scripts to make sure that the signals are tying off correctly.”

The bottom line is that solutions can’t come too soon.

Conclusion
Where high-bandwidth memory will be utilized is still unfolding. This is a new technology and its benefits are just beginning to be documented.

HBM is finding its way into leading-edge graphics, networking and high-performance computing today. But it also may find a role in deep learning, artificial intelligence, convolutional neural networking, fully autonomous vehicles, and other advanced applications that demand massive bandwidth and low power—particularly as the price of interposers continues to drop and chipmakers build up enough expertise and history working with this new memory type and different architectures and packaging.

Related Stories
Sorting Out Next-Gen Memory
A long list of new memory types is hitting the market, but which ones will be successful isn’t clear yet.
The Future Of Memory
Experts at the table, part 2: The impact of 2.5D and fan-outs on power and performance, and when this technology will go mainstream.
Convolutional Neural Networks Power Ahead
Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.
What’s Next For DRAM?
As DRAM scaling runs out of steam, vendors begin looking at alternative packaging and new memory types and architectures.



  • socketus popetus

    The Radeon R9 Fury X (Fiji XT)
    released on 24 June 2015 is impressing me even more now. It is that proof of concept chip that kick started the whole HBM ecosystem.