HBM2E: The E Stands for Evolutionary

The new version of the high bandwidth memory standard promises greater speeds and feeds and that’s about it.

popularity

Samsung introduced the first memory products in March that conform to JEDEC’s HBM2E specification, but so far nothing has come to market—a reflection of just how difficult it is to manufacture this memory in volume.

Samsung’s new HBM2E (sold under the Flashbolt brand name, versus the older Aquabolt and Flarebolt brands), offers 33% better performance over HBM2 thanks to doubling the density to 16 gigabits per die. What’s changed is these devices are based on eight 16-Gbit memory dies interconnected using through-silicon vias (TSVs) in an 8-Hi stack configuration. That means a single package is capable of 410 GB/s memory bandwidth and 16GB of capacity. With double per-die pin capacity, DRAM transfer speeds per pin can reach 3.2 Gbps.

Using thousands of TSVs required some creative electrical engineering to avoid signal loss. To deal with this, multiple TSVs are used for each data bit. Samsung used TSVs, so instead of connecting DRAM chips with wirebond connections on the side, the DRAM chips could be stacked. Each chip has holes cut for connection to the chip below.

HBM was getting to the point where the length of the wires was causing inductance issues. With HBM2, Samsung increased memory frequency to 1.2 GHz and decreased the voltage to 1.2V at the same time. The company says that to lower the clock skew it had to decrease the deviation of data transfer speeds among the TSVs, and it increased the number of thermal bumps between the DRAM dies to distribute heat more evenly across each KGSD.

Samsung has not disclosed how it achieved this 33% jump in performance, but it is likely an evolution of the above processes, like everything else.

“When we built HBM2, we wanted to expand the market breadth the device could attack, but also add in two dimensions—capacity and more bandwidth,” said Joe Macri, corporate vice president and chief tech officer of the compute and graphics division at AMD. AMD is a major partner with Samsung in the development of HBM. “It’s still 1,024 bits wide, but doubled the frequency to two gigachannels and added Error Correction Code (ECC) to get into data center and AI and machine language, since the entire data center market is built on a trusted data model.”

With HBM2E, AMD, one of the co-developers of HBM, is turning the same levers again. “The only bits added to the interface were to increase addressability, but it’s the same interface, it just runs at a higher interface of 3.2 gigatransfers per second,” Macri said.

With the 1,024-bit data bus, HBM2E runs very wide, but not very fast. Two gigabits of throughput is DDR3 speeds, notes Frank Ferro, senior director of product management at Rambus. “By going wide and slow you keep the power and design complexity down on the ASIC side. Wide and slow means you don’t have to worry about signal integrity. They stack the DRAM in a 3D configuration, so it has a very small footprint,” he said.

An HBM stack of four DRAM dies has two 128‑bit channels per die, for a total of eight channels and a bus width of 1,024 bits in total.

From 2 to 2E
The change from HBM2 to 2E is not revolutionary. It is pretty much a speeds and feeds update, but that’s more than enough for now, said Samsung.

“The key motivation in market outreach with HBM2E is higher capacity, and HBM3 will have even higher bandwidth and more capacity,” said Tien Shiah, senior manager for memory at Samsung. “HBM2 is limited in scope to the memory of the co-processor, primarily for AI and machine learning applications. But the forthcoming higher capacities will allow system architects to consider HBM in a much greater number of applications for more powerful next-generation artificial intelligence, machine learning, and Exaflop/Post Exaflop supercomputing.”

Hugh Durdan, vice president of strategy and products at eSilicon, agrees. “It’s more evolutionary than revolutionary,” he said. “I see extensions and enhancement of existing types of designs. HBM2E is an extension of HBM2. It’s faster but more important, adds address bits that lets you build a memory set four times as large and increase capacity of what you can put next to your SoC.”

Samsung is positioning HBM2E for the next-gen datacenter running HPC, AI/ML, and graphics workloads. By using four HBM2E stacks with a processor that has a 4096-bit memory interface, such as a GPU or FPGA, developers can get 64 GB of memory with a 1.64 TB/s peak bandwidth—something especially needed in analytics, AI, and ML.

Although vendors can stack the HBM to up to 12-hi, AMD’s Macri believes all vendors will keep 8-hi stacks. “There are capacitive and resistance limits as you go up the stack. There’s a point where you hit where in order to keep the frequency high you add another stack of vias. That creates extra area of the design. We’re trying keep density cost in balance,” he said.

Boosting AI
Existing apps will see a benefit from HBM2E just because of faster speeds and feeds. “A 50% capacity increase allows you to add so much more to the working set. So things you couldn’t cut up to fit now fit. As we scale our capability the memory is scaling with us,” he said.

At the same time, HBM2E opens the door to artificial intelligence in general, and machine learning in particular, because it is massively data intensive and requires processing of terabytes of data to train the machine. This is where HBM2E is expected to shine.

“One of the challenges in AI (from edge to cloud) is getting sufficient memory near the compute to ensure highest performance,” said Patrick Dorsey, vice president of product marketing in the Programmable Solutions Group at Intel. “As AI network models continue to grow in complexity and performance needs, the inherent flexibility of FPGAs to scale in compute capacity with a higher bandwidth HBM roadmap enables FPGAs to address new algorithms that were not possible before.”

High-performance compute and AI applications often require high-performance data compression and decompression, and HBM-based FPGAs can more efficiently compress and accelerate larger data movements. Dorsey sees FPGA with HBM enabling AI, data analytics, packet inspection, search acceleration, 8K video processing, high performance computing, and security.

Other uses
HBM primarily has been a GPU play, but Intel said its Stratix 10 FPGAs uses HBM2, and its newly announced Agilex FPGAs will support next-generation HBM integration.

Intel isn’t alone. There has been a flood of AI chips in the works or on the market from a variety of companies, including Microsoft, Google and dozens of startups, and they all are looking at HBM, said Shiah.

“We see two architectures emerging. One is in-memory processing, to address the compute memory problems. The other is very near-memory processing. Virtually every AI chip company that we’ve come across is either looking at HBM or going to HBM. There might be some startups trying to do something different with SRAM, but every major AI chip company that is either in production or exploring designs is using HBM,” he said.

Shiah noted that many AI applications are memory bandwidth constrained, as opposed to compute constrained. Companies have used the roofline model to determine this is the case. “The current industry limitation is the speed and capacity of high speed memory. HBM2E and HBM3 will address the memory bandwidth problem with much faster and higher capacity memory,” he said.

Another new area where HBM2E is expected to make inroads is network packet switching, which needs the bandwidth to pump through the exabytes of network traffic the Internet handles every day.

eSilicon’s Durdan predicts that network switch chips will see line rates continue to increase, from the current 28Gbps up to 56Gbps, and they eventually will go up to 112Gbps. HBM2E and HBM3 will be required to keep up. “Every time you double the line rate you process twice as much data, so you need memory capacity to keep up with data to be processed. [HBM2E] will enable higher capacity chips. The processors get faster, and the links between processors and software will get faster. Memory ends up being a key piece of the equation,” he said.

Cost remains an issue
Making 2.5D packages is not a cheap process, and HBM DRAM prices have been high due to limited availability. Both have limited the use of HBM to high-end networking and graphics devices.

Jim Handy, principal analyst with Objective Research, estimates the cost of a DRAM wafer at $1,600. HBM2 adds another $500 to that cost, a 30% premium. “DRAM makers would charge a lot more than 30% for it,” he said. “I expect [HBM] will remain in the GPU space mostly, because it’s expensive. If the market gets big enough, then production will get cheaper and open the doors to new apps. I wouldn’t be surprised if in 10 years the majority of apps use HBM memory, but I also wouldn’t be surprised if they don’t.”

HBM is not widely used in GPUs, either. Despite its broad product line, Nvidia only uses HBM2 in four of its cards—Tesla P100, Tesla V100, TITAN V, and Quadro GV100. AMD uses it in its Radeon 7 and MI lines.

The price has remained high because there has not been a widespread adoption by companies to work on a new cost structure or increase supplies. Samsung is the only supplier of HBM2E DRAM for now.

Hynix has an HBM2 product, but not 2E, and it has not said when it will have a product on the market. A Micron spokesperson noted that Micron is developing HBM2E, as well, and is engaging industry enablers to understand their needs for implementation of the technology.”

(Micron has since abandoned the rival Hybrid Memory Cube architecture, which never achieved JEDEC standard status like HBM. The HMC domain, http://www.hybridmemorycube.org/, has since expired.)

On the horizon: HBM3
JEDEC is not standing still. HBM3 was announced by Samsung and Hynix at the 2016 Hot Chips conference with the usual changes—increased memory capacity, greater bandwidth, lower voltage, and lower costs. Bandwidth is expected to be 512 GB/s or greater. The memory standard is expected to be released next year.

Then there are HBM3+ and HBM4, reportedly to be released between 2022 and 2024, with more stacking and higher capacity. HBM3+ is supposed to offer 4 TB/s throughput and 1024 GB addressable memory per socket.

As it stands now, details on HBM3 are sparse. “It’s still a few years away, the standard is not defined yet. People are thinking of what they might like, but it’s too far out to think about it,” said Durdan.

Shiah said that in the past, HPC/AI was a major driver for the HBM roadmap, so speeds and feeds were the priority. But that will change. “Newer applications, however, are likely to require other attributes, such as extending the operating temperature, which we will be considering in our HBM2E and HBM3 designs,” he said.

Related Articles
DRAM Tradeoffs: Speed Vs. Energy
Which type of DRAM is best for different applications, and why performance and power can vary so much.
Using Memory Differently To Boost Speed
Getting data in and out of memory faster is adding some unexpected challenges.
HBM2 Vs. GDDR6: Tradeoffs In DRAM
Choices vary depending upon application, cost and the need for capacity and bandwidth, but the number of options is confusing.
GDDR6 – HBM2 Tradeoffs
What type of DRAM works best where.
HBM2e Offers Solid Path For AI Accelerators (Blog)
AI processor performance is rapidly growing, making memory architecture choice more important.



2 comments

Jeddidieya says:

“but doubled the frequency to two gigachannels”

Did you mean two gigatransfers and really let’s mention that the memory is DDR (Dual Data Rate) so the actual clock rate is half the transfer rate.

And let’s be sure to mention that the JEDEC standard for HBM2 is what most are using currently and that standard for HBM2 does support 8, 128 bit channels per stack with an additional 64 bit pseudo channel mode that can divide each 128 bit channel into 2, 64 bit pseudo channels and I’d really like some deeper dive into just what workloads may benefit from that 64 bit pseudo channel mode on HBM2.

And I’ll assume that HBM2E has all of HBM2’s feature sets and just a bit more speed/other features to rate the that E being tacked on. Higher capacity is welcome for many different workloads where having more of the data closer to where it’s needed saves a great deal on energy usage.

AMD’s GPUs since the Fiji(The Radeon R9 Fury X at 28nm) have made use of HBM with AMD switching over to HBM2 for Vega 10 at 14nm on to Vega 20 at 7nm. But yes HBM2 is a bit too expensive for the consumer/gaming market currently and I’m sure that AMD was hoping that their and SK Hynix’s work on creating HBM would have reached an economy of scale quicker to replace GDDR5/GDDR6 earlier. But that has not happened currently but will given time and the Exascale computing as well as the HPC markets should help.

What does not help currently is the relative lack of sufficient numbers of HBM2 competitors in the market place and that’s been the major factor in HBM2 prices remaining on the higher side.

Now as far as the consumer gaming market is concerned if a single stack of HBM2E memory can support 410 GB/s memory bandwidth then that’s in the ball park of what 2, HBM2 stacks provided at the lower clocked/lower bandwidth provided on that earlier HBM2 standard so that’s good news for a single stack of HBM2E options for the consumer/gaming GPU market.

So Maybe Samsung/JEDEC/Other HBM Makers are trying to make that HBM2E a more affordable option compared to GDDR6 where GPU makers can get sufficient bandwidth to feed a consumer/gaming GPU SKU and that needed bandwidth provided with only one HBM2E stack required.

The current Macbook Pro Laptops make use of one the Only few AMD discrete mobile Vega GPU SKUs that actually offer HBM2 and HBM2E appears to be ready made for the Discrete mobile GPU market(The mainstream gaming GPU market as well) where only a single HBM2/HBM2E stack would be needed to provide more than enough bandwidth for a beefier Diecrete mobile GPU variant.

A single stack OF HBM2E to replace 2 stacks of HBM2 would come in at closer to half the cost and still be able to provide sufficient bandwidth across a 1024(8, 128 bit independently operating channels) bit connection and there are not many GDDR6 configurations with a connection that even has 512 bits with most GDDR6 configurations in the 256 to 384 bit width range.

So HBM2E will definitely be an option to supplant GDDR6 on more GPUs other than the commercial GPU SKUs where higher costs for HBM2 can be justified for the performance gained. The consumer GPU market is much more cost conscious but HBM2E looks to be the option for a single HBM2E stack being more cost competitive with an 8 die of arrangement GDDG6, and HBM2E’s single stack bandwidth is closer to the bandwidth that most GDDR6(8 Dies) based GPU configurations offer.

SK Hynix is stating 460GB/s bandwidth for their HBM2E so that’s very close to AMD’s Vega 64 with its 483.8 GB/s rated 2, HBM2 Stacks configuration rated bandwidth. And that single stack of Hynix HBM2E exceeds Vega 56’s 409.6 GB/s bandwidth across 2 lower clocked HBM2 stacks. One only needs enough bandwidth to support a GPUs total number of shader cores at their peek bandwidth usage metric as most workloads are not that demanding so a single stack of HBM2E should be a more attractive solution in the mainstream consumer gaming/GPU market.

evolucion8 says:

HBM2 is also used on the Vega 56, Vega 64 and its MI cards that are based on these GPUs (Vega10)

(So not just Radeon VII and its MI derivatives aka Vega20.

Leave a Reply


(Note: This name will be displayed publicly)