The number of options and tradeoffs is exploding as multiple flavors of DRAM are combined in a single design.
Chipmakers are beginning to incorporate multiple types and flavors of DRAM in the same advanced package, setting the stage for increasingly distributed memory but significantly more complex designs.
Despite years of predictions that DRAM would be replaced by other types of memory, it remains an essential component in nearly all computing. Rather than fading away, its footprint is increasing, and so are the number of options.
There are several factors driving this expansion. Among them:
For all these reasons and others, chipmakers are using more DRAM. In some cases, DRAM — particularly high-bandwidth memory (HBM) — is replacing some SRAM. DRAM has a proven track record of endurance, as well as mature processes, and it is much cheaper than SRAM. In raw numbers, SRAM may cost upwards of 2,500 times more than DRAM for the same capacity, depending on type of DRAM, according to Jim Handy, general director of Objective Analysis.[1]
There is a spectrum of DRAM available, of course. Some is very fast, such as HBM, but also expensive. Other types are slower but cheap, such as basic DDR DIMMs. What’s changed, though, is that in a heterogeneous architecture both can play important roles, along with multiple other DRAM flavors and more narrowly targeted memories such as MRAM or ReRAM.
“We’re looking at more of a mixed model, using different DRAM technologies in the same system,” said Kos Gitchev, senior technical marketing manager at Cadence. “If you really need very high performance, and you’re willing to pay for it, then you’ll probably go for HBM. You can use that for L3 cache, or whatever else you need to access immediately. If you still need more memory, but with a little more latency, you can use DRAM in an RDIMM (registered dual in-line memory module) or an MRDIMM (multiplexer-ranked DIMM). And if you’re looking for large capacity, then maybe you’re looking at DRAM behind CXL. That technology is starting to target very specific applications with really high bandwidth and low power, larger memory footprint, but with a little more latency. Mixing all of those together is the direction everybody is going to solve those problems.”
As with nearly every improvement at advanced nodes or in advanced packaging, solving one issue can lead to another. Still, the underlying theory is sound, and there are proof points in the market today. For example, it may be essential to keep some features running at maximum speed, such as AI, which would make high-bandwidth memory the optimum choice. But not all features are essential, and they don’t all require that level of performance. In some cases, GDDR5 or GDDR6 may be sufficient. In others, it may be LPDDR, and in others maybe DDR4. There are different costs associated with all of these, and those costs can be measured in resources to move data back and forth, as well as the monetary value of the memory chips.
The flip side of this is not all DRAM is created equal, and just adding different flavors of DRAM without fully understanding how they will affect other components can cause problems. It’s important to integrate them in a way that avoids future issues, and that includes sophisticated floor planning to avoid signal integrity and prevent thermal issues. It’s well known that DRAM and heat do not go together well. But there also are a bunch of new concerns that were never seriously considered before.
“The big issues for DRAMs moving forward break down into two categories — the usual suspects (more bandwidth and capacity, managing power), and some new ones (more challenging reliability, which are causing things like on-die ECC and RowHammer protection),” said Steven Woo, fellow and distinguished inventor at Rambus. “For the new challenges, putting more capacitors on a chip is increasing the occurrence of on-die errors, so you see DRAMs today that do some amount of on-die error correction before data is returned to the controller. And neighbor cell-disturb issues like RowHammer occur because the cells are in such close proximity to each other that accessing one set of cells can cause close neighboring cells to have their bits flip.”
Fig. 1: Memory interface chip on a DIMM. Source: Rambus
What works best where
The growing number of options also makes it hard to decide which memories to use. DRAM typically is chosen based upon performance, power, cost, reliability (error correction code, as well as fully tested and supply-chain secure), and capacity. So if DRAM is going to be used for L3 cache, it likely will require high performance and low power. If it’s going to be used for a low-level feature in an advanced package, it may be a standard DIMM.
But each of those choices also affects the overall chip or system-in-package design, and comes with specific design considerations.
“In the past, DDR4 and LPDDR4 were not over-the-top complicated,” said Graham Allan, senior manager for product marketing at Synopsys. “One customer would enable DDR4, and another would enable LPDDR4, and there was overlap. As we’ve gotten past those generations into DDR5 and LPDDR5 and beyond, those application spaces have really diverged, and so have the interface protocols and physical signaling. DDR5 typically wants to talk to tons and tons of DRAMs — large capacity — so you’re mostly interfacing to register DIMMs. With LPDDR, you’re typically talking to one package or device, and you have a maximum of two loads in that device. LPDDR is also ground-terminated. DDR is terminated to the positive voltage rails. Those are very different physical interfaces and protocols, and that means customers need to choose one or another.”
There also are some in-between options that can help utilize the same design across multiple applications. MRDIMMs, for example, either can be used to double the capacity or double the bandwidth, depending upon the workload. “Multiplexer-ranked DIMMs allow twice the capacity and a speed of up to twice that of the SDRAM (synchronous DRAM),” said Allan. “The beauty of it is the DRAMs don’t change. It operates in two different modes. It operates like a load-reduced DIMM where it doesn’t double the speed. That would be a mode where you are using it for higher capacity. Or it operates in the multiplex-ranked mode, which doubles the bandwidth between DRAMs and the external interfaces.”
That’s part of the picture. The other part is the PHY, or physical layer, which provides a physical interface to the memory. PHYs vary by the type of DRAM used, and they have become particularly important as the amount of data increases and as designs become increasingly heterogeneous.
PHYs also can be linked together into a kind of master stack in order to manage memory resources in a complex device, regardless of whether that is a GDDR6 or a LPDDR4. That way all types of DRAM can be viewed as available resources and managed centrally.
“With some type of fabric where you manage the bandwidth, everything is visible and addressable,” said Balaji Kanigicherla, corporate vice president and general manager of Renesas‘ Infrastructure Business Division. “It’s not just about improving the density, or the physics of the memory, which is material science. The application architecture of the memory is where the industry needs to go. The density needs to improve, because you want more capacity at the same bandwidth. We can mix and match based upon the path per dollar or per gigabyte, and we can use tiering between the SSD, the DRAM, and the local on-chip SRAM caches. This is shifting to a TCO for the entire system, and looking at the cost we’ll be paying for each tier.”
This essentially raises the abstraction level for memory management. “You can evolve from the current model to address memory at a global level, and basically create enough efficient interconnects to manage caching or reduce latencies,” said Kanigicherla. “It’s like a partition of global addressable memory. It’s evident that you need to provide the bandwidth. But the good news is that with AI workloads, they are a little less sensitive on latency and more sensitive to the bandwidth. So you can take this technology to scale up. Between CXL and UCIe, there should be a more gradual way to disaggregate the memory, maybe include optical interconnects, and enable a full global view of the memory. But it takes the whole industry to get there. It’s not straightforward.”
Fig. 2: Centralized control of system DRAM. Source: Renesas
Memory pooling is another option, and one that is gaining traction in data centers. Memory pooling does for DRAM what hyperscaling does for processor cores. When additional memory is needed, it is made available the same way additional compute cores are made available, usually through a CXL interface.
“The idea behind pooling is that if I’ve got a set of servers, and they’ve each got memory in them, then it’s really unlikely that each of them is using all of their memory capacity at the same time,” said Rambus’ Woo, in a recent presentation at a CASPA event. “What makes more sense is to take some of that capacity and put it in an external chassis and treat it like a pooled resource. When the processors need more than they’ve got inside the chassis, they can check out and provision some of that memory for a short period of time, use it for their computations, and then return it back into the pool. That’s one of the new features that has a lot of people in the industry excited. A little further out, once you do these types of things, you can start thinking about attaching memories and pools through switches. The CXL standard also allows for multi-level switching, as well. That kind of flexibility will help to improve both performance and total cost of ownership across a very wide range of applications.”
Other memory approaches
In addition to the more traditional approaches, DRAM is branching out in a variety of directions. In part, this is due to the shift to heterogeneous integration and advanced packaging, with more domain-specific designs, and in part because of the benefits of processing closer to the source of data.
“Comparing computation and DRAM, we are using 17% of the energy for computation and using 63% of the power moving the data from one point to the other,” said Jongsin Yun, memory technologist at Siemens Digital Industries Software. “This is a significant amount of energy. We can save that and improve speed and the power efficiency. The current solution is adding more memory into cache, but that’s an expensive solution. We don’t need to move all the data to DRAM. We can do some compute in memory, or use some GPU-based AI convolution so we can do the calculation without the memory transfer.”
There are more options today than ever before, and there are many more in the development stage. Winbond, for example, developed a couple memory solutions that are based on DRAM, but which go beyond the classic DRAM use model. One is the company’s single-die CUBE (customized ultra-bandwidth element) architecture. The other is pseudo-static DRAM, which fits somewhere between SRAM and DRAM, eliminating the need for external data rewrite. Both of these are aimed at specific markets such as wearables and edge servers.
“Right now the hottest topic is generative AI,” said CS Lin, marketing executive at Winbond Electronics Corp. America. “But what’s happening in the data center has different requirements than where we are focusing, and there is very different density. We are focused on density of 16 gigabytes/second and below, but the solution is scalable down to 256 kilobytes/second. It runs at pretty close to HBM2 bandwidth, but with the advantage of very low power.”
Fig. 3: CUBE approach, with ~25ns latency and 5X higher unit density than 14nm SRAM. Source: Winbond
The benefit of this approach, said Lin, is the ability to use standard DRAM to boost performance, rather than relying on the most advanced process nodes. Typically, higher density creates latency, but the CUBE architecture uses thousands of through-silicon vias to move data, with a flexible assignment of those vias based upon the need for either more bandwidth or higher speed. That allows a more granular system architecture, as well as a smaller footprint.
Another approach is equalization. This has been on the drawing board for some time, but it finally appears to be gaining traction. “Equalization improves the data you receive at the end of your channel,” Synopsys’ Allan explained. “In very simple terms, it’s like intersymbol interference. When a series of bits is being transferred across the channel, by the time one bit is done, it’s actually into the next bit’s time domain. Signals going up and down and switching from ones to zeroes take longer than one unit interval. You’re not starting from a steady state low potential. You’re starting from a higher state. You’re offsetting the sampling point in your input receiver using decision feedback equalization. So how can I now optimize my input receivers, such that I’m going to have similar margin for a one and a zero detection? I’m not really sensing something to put the reference voltage exactly in the middle.”
Also on the horizon is in-memory compute. While there have been a couple commercial approaches using MRAM, researchers at Princeton University in a 2019 paper demonstrated an external DRAM controller in an FPGA that can be used with off-the-shelf DRAM to create a massively-parallel computation. The researchers claimed this approach overcomes the so-called memory wall, in which logic performance has outpaced memory bandwidth.
Tradeoffs
So how much SRAM gets used versus DRAM? There is no simple formula for this, because isn’t an apples-to-apples comparison.
“There really is no magic way to do this,” said Cheng Wang, CTO and co-founder of Flex Logix. “Most of our design tradeoffs come from a performance estimation that models SRAM bandwidth, SRAM capacity, and DRAM bandwidth. Those are our three primary knobs. And basically we have four standard sizes of compute, with different amounts of SRAM and DRAM bandwidth for our standard offering for IP. That’s based on our empirical data of running models to determine what works better. Some models can run better if we have 2X the amount of SRAM. If you can almost double your performance by doubling the SRAM, and you put another 20% of the area for 2X performance, that’s great. But there are a lot of other models that wouldn’t be able to benefit from the additional SRAM, and then you’re adding that area for nothing. That’s why it’s important to have cycle-accurate performance estimates. It’s not accurate down to a single cycle in our case, but it is accurate to 8%, which is more than what we need. And then you can do a lot of architectural analysis of proper SRAM/DRAM compute tradeoffs, which may differ by the type of workloads.”
This is complex math, and it’s becoming more complex as systems are disaggregated into heterogeneous elements, such as chiplets. “SRAM requires more transistors per bit to implement. It is less dense and more expensive than DRAM, and has higher power consumption during read and write,” says Takeo Tomine principal product manager at Ansys. “Currently, SRAM is designed on advanced finFET technology nodes that a CPU typically is designed with, and a finFET device is more prone to thermal effects (self-heat) due to higher thermal resistance of the device.”
In some cases, what type of memory to use, and where to use it, may come down to the expected lifespan of a device. “There are two major reliability concerns that lead to lifespan reduction of memories,” Tomine said. “One is that interconnecting reliability with technology node shrinking leads to lifespan of memories due to self-heat causing severe electromigration (EM), which is one of the most critical reliability issues. EM lifetime improvements by material and process technologies continues, along with technology scaling. The second is reliability challenges from different architectures of devices. In moving device architectures from finFETs to nanosheets to CFETs, thermal resistivity increases drastically, which translates to a higher delta T device channel. Device self-heat will couple with metal joule heating. Self-heating of a device will impact gate oxide breakdown (time-dependent dielectric breakdown), and also will degrade HCI (hot carrier injection), which will worsen the BTI (bias temperature instability) of the device.”
Reliability is a measure of the ability of a memory device to perform without failure for a given amount of time. That timeframe can be very different for a smart phone, which is expected to last 4 years, versus automotive, military, or financial server applications, where the life expectancy can be 10 to 15 years (or more). Being able to understand the potential interactions that can affect the lifespan of memories is critical, and they can vary by architecture and by memory type and usage.
That also affects what kind of memory is used, and the overall system architecture. So if memories can be swapped out, lifespans are less relevant than if those memories are embedded into some type of advanced package and sealed up. “It’s like having a pool of DRAM cards, which can be upgraded today,” said Renesas’ Kanigicherla. “With HBM, you can’t do anything if something goes wrong, so you’re throwing away a very expensive chip. On the CPU side, the servers are very closely attached, and there is not much you can do to upgrade anything. That’s why this global shared memory concept works. Some of these solutions come in automatically.”
Latency adds another tradeoff. “Especially with HBM, you’re putting the processor and the DRAM very close together,” said Frank Ferro, group director for product marketing in Cadence’s IP Group. “There are a lot of advantages to doing that. HBM has been advancing pretty rapidly. We see improvements almost every two years in performance. So that curve is steep. But from a system design standpoint, 2.5D is still a challenge. Optimizing the interposer and helping customers design that is really a key part of the conversation.”
Conclusion
Since its invention in 1967, DRAM has been a linchpin for computing. While numerous memory technologies have challenged it over the years, nothing has displaced it. And given the frenzy of activity surrounding this technology, nothing will displace it in the foreseeable future.
Rather than one type of DRAM, there now are many types, and each of them is evolving and spawning new ideas. There is innovation on every level, from the physical connection of the memory to the processing elements, to the pooling of memory outside a rack of servers. And there is work underway to shorten the distance signals need to travel between memory and processor cores, which would reduce the amount of power needed to move that data and the time it takes per cycle.
Put in perspective, DRAM remains a dynamic and innovative field, and there are more innovations on the horizon and different ways to put together a memory solution that can have a big impact on performance, cost, reliability, and longevity.
Reference
1. Direct comparisons between SRAM and DRAM costs are not always clear-cut, according to Objective Analysis’ Handy, because SRAM is sold by the chip and DRAM by the byte. On the spot market, which is a clearinghouse for excess inventory, untested and unmarked DRAM sells for as little as $1 per gigabyte, while the same amount of SRAM would cost roughly $5,000.
Further Reading
DRAM Test And Inspection Just Gets Tougher
Increased size, faster interfaces, and 2.5D/3D packages puts squeeze on inspection and test methods.
HBM’s Future: Necessary But Expensive
Upcoming versions of high-bandwidth memory are thermally challenging, but help may be on the way.
Leave a Reply