Flash Getting Stacked High-Bandwidth Version

Inspired by HBM, HBF could improve AI efficiency in 3D flash memory.

popularity

Key takeaways:

  • A new HBF 3D flash stack is similar to HBM for use in AI processing.
  • HBF capacity will be much higher, allowing static storage of AI model weights, with optimized read speed.
  • Samples are due out later this year, with accelerators featuring it coming out next year.

AI inference using modern models requires billions of parameters, and moving them to where they can be consumed requires time and energy. A new effort to standardize a high-bandwidth version of flash memory proposes to keep those parameters much closer: inside the package with the GPU.

Flash is known for high capacity and data that requires no refresh. But its performance has been no match for high-speed computing. To address this, Sandisk has proposed a 16-die-plus-base-die flash stack that fits the same footprint as HBM, although with a different interface protocol, named high-bandwidth flash (HBF).

“HBF was announced by Sandisk in 2025 to utilize flash for high bandwidth, high-capacity memory, targeting AI inference applications,” said Xi-Wei Lin, executive director, applications engineering at Synopsys.

This could make it possible to store all parameters right next to the GPU, with no need for them to leave the package, as the read speed is optimized to ensure quick retrieval.

“HBF is gaining attention as system designers look for ways to improve data storage and access, for example, with new memory tiers that sit between DRAM and traditional NAND,” said Steven Woo, fellow and distinguished inventor at Rambus.

Working with SK Hynix, Sandisk has submitted the technology to the Open Compute Project (OCP) for standardization.

“Sandisk plans first samples of HBF in the second half of 2026 and expects the first inference devices using HBF to sample in early 2027,” Woo noted.

Bringing weights from afar
AI processing involves large amounts of data that can be roughly split into two camps. One set of data comes from inputs to the AI model, as well as the computational results of each processing layer. That data is dynamic, with external inputs coming in real time, so it needs a place to store and retrieve results as they appear. These data are often generically called activations.

The other set of data is the weights (or parameters) that represent the model. That data doesn’t change during an inference run, and so, in theory, it could be packed into a GPU or other processor and just left there. The problem is, it’s an enormous amount of data — more than could fit on a processing chip.

Companies working on in-memory compute (IMC, also called CIM) took advantage of the fact that vector multiplication could be done easily, given some tweaks to a nonvolatile memory array. The theory is that you can load the weights once and then run the model many times. The only data moving in that case are the activations.

But there’s only so much data one can store on a single chip. And IMC approaches haven’t been able to tackle today’s LLMs simply because they require too many parameters. This reinforces the fact that these models need more memory than will typically fit on a single chip, especially a processing one.

Storing them near the processor
So, where to store the weights? For giant models, the highest-capacity nonvolatile memory (NVM) would be an SSD plugged into the rack. From a processing standpoint, those weights are as far away as they can be within the confines of a rack. More challenging would be weights stored on some network-attached storage even farther away.

That gives a caching hierarchy involving three types of memory:

  • Nonvolatile memory is the long-term storage
  • Weights needed for use are retrieved from storage and are cached in DRAM — possibly HBM.
  • Those weights are further cached in SRAM once brought to the processor.

This means that a weight requested for the first time will experience very high latency, as it goes from storage through DRAM to SRAM. Once cached, weights can be accessed more quickly, but not all can fit in the cache. If evicted and needed again later, they’ll have to traverse that whole path again.

The idea with HBF is similar to the one that inspired HBM: a stack of high-I/O NVM dies co-packaged with a processor. But it provides a greater benefit than HBM because the distance from an SSD is typically farther than that of standard DRAM DIMMs, whose access HBM accelerates. And the weights can bypass DRAM entirely, although that might complicate the caching architecture.

Fig. 1. HBF’s role in AI. It enables the weights to be stored in the package rather than down the network somewhere. Source: Bryon Moyer/Semiconductor Engineering

“HBF is an architectural stage that bridges the gap between high-bandwidth access and high storage footprint,” said Sharad Chole, chief scientist and co-founder at Expedera. “By connecting these SSD stacks directly to the accelerator, AI workloads should be able to access memory in the same way they currently address DDR. This will allow for simpler designs in the future where storage is not limited to PCIe speeds and latencies.”

In this model, HBM would be employed only for variables generated during processing — activations — rather than pre-calculated ones.

“We’ll load the model weights from storage, then process the computation context in DRAM,” said Pax Wang, director for advanced packages at UMC.

Flash challenges
Flash’s big claim to fame is its capacity. “HBF is a NAND-based architecture targeting 8x to 16× the capacity of HBM at the same read bandwidth and price points of HBM,” Rambus’ Woo said.

That capacity far outstrips what’s available in HBM. Jongsin Yun, director of memory research at Siemens EDA, noted, “The latest HBM stack can hold up to 192 GB, and the next product is targeting around 400 GB. But with HBF, they’re already reaching 3 Tb.”

Its Achilles’ heel has been speed, write speed in particular. When writing to flash, several effects conspire to slow things down:

  • In order to prevent overprogramming of cells, they must be erased before being written.
  • The block architecture that helps minimize die area means that an entire block must be erased if even one bit in that block is to be written.
  • The physics of the write process takes time.
  • All bits that were erased must be rewritten.

Although incremental improvements to performance may always be possible, there’s no way to get around the write-performance hit. For this reason, HBF can’t completely replace HBM since the system needs memory it can write to with reasonable speed, and that will never be flash (at least how it’s implemented today).

“Flash is designed to be as cheap as it can possibly be, and because of that, it does give up some speed,” noted Jim Handy, memory analyst at Objective Analysis. “Most of the speed that you lose in flash is in the write cycle. You’ve got to put up with this quantum mechanics stuff that slows it down. But for reads, it could be faster.”

Inference, not training
As a further result of the slow write speed, HBF is likely to be useful only for inference engines, which have static weights. “It’s being targeted at inference rather than training,” said Handy. “During training, you’re constantly changing the weights. During inference, you leave them alone.”

Sandisk has further optimized the read path to reduce weight-fetch latency even further while keeping each die in the stack monolithic. “High Bandwidth Flash is about more than new interfaces or form factors,” explained Cynthia Hsu, senior director of flash memory design at Sandisk. “It involves re-architecting the internal read path and leveraging multi-array parallelism more intelligently so data can be accessed and delivered with lower effective latency and more consistent bandwidth, with NAND, controller, and firmware co-designed as a system.”

As an additional challenge, flash cells can be written only a limited number of times before it wears out. It’s not yet clear what HBF endurance will look like. “Some products offer 10,000 writes, and some rare products offer 100,000, but they are in the level of thousands,” said Yun.

Other NVM technologies exist, such as MRAM and RRAM. But both are still finding their way. MRAM is further along, but it comes with a tradeoff: cells are typically optimized for speed or data retention, not both. So that means that the same cell couldn’t provide both storage and low-latency access. It could be that a middle ground, with sufficient latency, could be found, but flash is so much more mature that it’s the natural choice at this time.

RRAMs are even farther away despite years of development. One of the constant challenges for new NVM cells is that, out of the gate, with no manufacturing learning, they must be cheaper than the highly optimized incumbent, flash. Would a high-bandwidth version (HBR?) provide an opportunity to get volumes and learnings going? Possibly, but it’s hard to imagine it competing on price in the early days. And if it can’t win in the early days, then it won’t get the learning necessary to reduce the cost long-term.

Sandisk opted against these newer cell types. “We’ve looked broadly at a range of non‑volatile memory technologies over time,” said Hsu. “But High Bandwidth Flash is built around the strengths of NAND — density, scalability, and cost efficiency — using a technology foundation that’s proven and available today, reimagined to help deliver the bandwidth levels these systems require.”

Specs and schedule
HBF capacity is 256 GB per die, which gives 512 GB per 16-high stack. Read bandwidth is 1.6 TB/s. It will match the footprint, power profile, and physical stack height of HBM4.

Sandisk anticipates sample availability in the second half of this year, with systems appearing in 2027. The company also teased a roadmap for future improvements.

  Gen 1 Gen 2 Gen 3
Capacity 1.5×
Read bandwidth 1.45×
Energy consumption* 0.8× 0.64×

* Sandisk calls this “energy efficiency,” but lower future numbers shouldn’t be interpreted as lower efficiency.

Sandisk is largely known for NVM in a variety of form factors. It’s working with SK Hynix, which also makes NVM alongside a variety of DRAM flavors. Together, they’re sponsoring a standardization effort at the OCP.

“HBF is not a drop-in solution to replace HBM,” explained Lin. “It requires a different interface, so standardization is needed to make the architecture proliferate. Sandisk and SK hynix recently signed an MoU for that purpose. It may take some time for more vendors to join while development is underway.”

The choice of OCP is a little surprising since most memory standards, such as those defining the HBM families, come through JEDEC. Sandisk’s Hsu explained that, “For the closed working-group phase of this standardization effort, we chose OCP because the workstreams are highly goal-oriented, which allows us to iterate on the specification in real-time and match the rapid innovation cycles of the AI market.”

An important aspect of HBM4 is the possibility of doing a custom base die. Asked whether Sandisk anticipated something similar for HBF, Hsu replied that, “Standards play a key role in accelerating technology adoption by helping ensure compatibility, scalability, and a clear path for ecosystem integration. For a new technology like High Bandwidth Flash, broad, industry-wide collaboration on a common specification is critical to drive adoption and alignment across the hardware, software, and system ecosystem, and that’s what we’re focused on right now.”

So that means not now, but not never.

Taking a weight off
HBF would seem to have a very specific home and purpose, and it might seem like a lot of work to do for just one application. It’s worth remembering that HBM was in a similar situation, and it’s proved its worth. HBF may have benefited a narrower class of workloads, but it’s a huge class that we’re going to see only more of. Those AI weights aren’t going anywhere.

Ultimately, it provides new architectural options. “We expect HBF to impact how AI accelerators are designed for data centers,” said Chole. “It opens a dimension that architects did not previously have access to.”


Related Articles

Memory Wall Gets Higher
With SRAM failing to scale in recent process nodes, the industry must assess its impact on all forms of computing. There are no easy solutions on the horizon.

Cryogenic Etch: A Key Enabler Of 3D NAND
Next-gen 3D NAND depends on the performance and repeatability of cryogenic etching processes.

Metrology Digs Deep To Produce Next-Generation 3D NAND
Deep vertical holes and re-entrant features challenge the best metrology methods.



1 comments

C Lytle says:

Great write up on this exciting technology. Will revolutionize datacenter intference.

Leave a Reply


(Note: This name will be displayed publicly)