Memory footprint, speed and density scaling are compounded by low-power constraints.
Chipmakers are paying much closer attention to various DRAM options as they grapple with what goes on-chip or into a package, elevating attached memory to a critical design element that can affect system performance, power, and cost.
These are increasingly important issues to sort through with a number of tradeoffs, but the general consensus is that to reach the higher levels of performance required to process more data per watt, power efficiency must be improved. Increasingly, the flavor of DRAM used, and how it is accessed, play a huge role in achieving that performance target within a given power budget.
“The low power space is about how to maintain the signal integrity and get to the performance levels needed, all the while improving the power efficiency,” said Steven Woo, fellow and distinguished inventor at Rambus. “In the mobile space, there’s no end to sight to the performance needed on the next generation. There are a lot of concepts that previously were used in just one type of memory, such as multiple channels, which now are being implemented across different markets.”
The main questions are centered around what concepts get borrowed for the other types of memories, and how the low power market will start influencing what goes on in main memory. Low-power engineering teams have been focusing primarily on trying to save power and stay within a certain power envelope. “Given that DRAMs are all based on the same basic cell technology, it’s about how to optimize everything else that’s around it, how to optimize for a low power environment, and how to optimize for a high performance environment,” Woo said.
Rami Sethi, vice president and general manager at Renesas Electronics, agreed. “There’s a general acknowledgement that the DDR bus, in terms of its ability to support multi-slot configuration, is going to become increasingly challenged over time. As you run that wide parallel pseudo-differential, mostly single-ended-style kind of bus into the 6, 7, 8 Giga transfers per second speeds, first you’re going to lose that second slot. You’re not going to be able to support a multi-slot configuration at those speeds at some point. When that happens, you effectively cut your memory capacity in half. DRAM scaling and density increases will make up for some of that, and you can continue to add more and more channels to the memory controller. But ultimately that’s going to run out of steam, and that approach just won’t get you the incremental capacity that you need.”
Stuart Clubb, technical product management director for digital design implementation at Siemens EDA, points to a similar trend. “Some years ago, NVIDIA published an energy cost comparison showing that going out to main memory (specifically DRAM) versus local CPU registers was about 200X more energy cost for the same computational effort. Other papers in the past have detailed the near 20X difference in power costs between Level 1 cache fetches and main memory. It stands to reason that anything you can do to reduce iterative main memory access is going to be of benefit.”
Further, as power and energy become even more important product metrics, the use of application-specific accelerators to supplement general-purpose compute resources is increasing.
“Be it processing-in-memory, computational storage, bus-based accelerators on the SoC or the PCI server side, or in-line pre-processing, the need for efficient hardware that reduces main memory-associated energy costs is growing,” Clubb said. “Specialized accelerators need to be built to specific tasks for low energy consumption. While traditional RTL power estimation and optimization tools can help in the general RTL design effort, this accelerator space is where we have seen an uptick in the use of high-level synthesis. Exploring the design space, experimenting with varying architectures, and ultimately building those accelerators with competitive low-power RTL is where HLS is adding value. Custom accelerators with localized lower power memory solutions have the advantage that when not in use, you can turn them off completely, which you won’t be doing with a CPU/GPU/NPU-type solution. No matter how much you try to optimize your main memory architecture, the energy cost of data movement is probably something you really want to avoid as much as possible.”
Muddy water
Inside of data centers, this requires some tradeoffs. There are costs associated with moving data, but there also are costs associated with powering and cooling servers. By allowing multiple servers to access memory as needed can improve overall memory utilization.
This is one of the main drivers for Compute Express Link, a CPU-to-memory interconnect specification. “CXL basically allows you a serial connection,” said Renesas’ Sethi. “You’re reducing the number of pins that you need. You can put modules that are based on CXL further away from the CPU, and there’s better extensibility than you get with a direct DDR attach. Those modules can look more like storage devices, SSDs, or PCIe add-in cards, so you can get more density in the form factor. Then, CXL also gives a lot of the protocol hooks for things like cache coherency, load store, memory access, so it starts to allow DRAM to look more like a direct attached DRAM, or at least like a non-uniform memory access (NUMA)-style DRAM access.”
That, in turn, requires more consideration about how to architect memory, and which type of memory to use. “When people are choosing between memories, such as DDR and LPDDR, realistically there are some things that are going to be DDR for a really long time. If you’re building a big server, you’re probably going to build a memory out of DDR. If you’re building a mobile phone, you’re probably going to build a memory out of LPDDR. There are very clear cut things on both sides,” noted Marc Greenberg, group director, product marketing, DDR, HBM, Flash/Storage, and MIPI IP at Cadence. “However, the middle is less clear. For the past five years, it’s been becoming increasingly muddy where, for example, LPDDR is being used in places that traditionally may have been exclusively the purview of the DDR memories.”
Each has its strengths and weaknesses, but there is enough overlap to confuse things. “One of the strengths of DDR memory is the ability to add capacity in a removable way, such that if you want to add more gigabytes of storage or have many gigabytes of storage, DDR is the way to do it in most cases. What the LPDDR memories offer is a certain range of memory densities, of capacity that match well to the mobile devices they go into. But those capacities in some cases match certain types of computing functions, as well. One of the areas where we’ve seen LPDDR memory start to make its way into the server room is in various kinds of machine learning/artificial intelligence accelerators.”
There are a number of attributes that every memory type has, from how much capacity can be put on the interface to how easy it is to add capacity and how much bandwidth is supports. There also are power and enterprise reliability standards to consider. Greenberg noted that in some instances, the LPDDRs match the requirements for certain types of systems better than the DDR memories, even if it’s a traditionally DDR type of application.
Randy White, memory solutions program manager at Keysight Technologies, believes there are more or less two choices — DDR or LPDDR, with HBM and GDDR being used in more specialized designs. “Unless you are specializing in a niche application, the choice really comes down to two. DDR is probably 60% to 70% of the volume of memory out there between data centers and desktops. LPDDR is 30% or more, and that’s growing because it tracks the number of new products that are introduced from phones to other mobile devices. At the same time, LPDDR tends to be ahead of mainstream DDR by one year or more. That’s been the same for years and years, so why is LPDDR always pushing the envelope? They always get the specs released earlier, as well. It’s because the phone is the target device, and that’s where there’s so much money tied up. You don’t want the memory to be to be the bottleneck.”
Choosing the right memory
White says this decision comes down to one of two things — capacity and applications. “Do you need more than 64 gigabytes? A phone or a mobile application would use anywhere from 16 to 32 gigabytes for system memory. This is different than storage for all of the videos and files. You pay a lot of money for that, and your phone provider sells you options for that. But the system memory is fixed. You don’t know about it, it just works. For phones, you don’t really need that much. A server that’s running thousands of virtual machines, doing financial transactions, engineering, database queries, or Netflix streaming, needs terabytes of memory, an order or more of magnitude. That’s the number one selection criteria. How much do you want?”
The second consideration is where it’s going. “What’s the form factor that you’re going into? Servers need a lot of memory so they have many DIMM slots. But how are you going to get 64-gigabyte, multiple memory chips into a phone? The phone fits in the palm of your hand. Don’t forget about the display, the battery, the process, so the space constraints are different,” White said.
An additional consideration involves the evolution of mobile design. “You need more compute power,” he said. “You need more memory, but space is shrinking. How do you deal with that? This is a really fascinating trend. There’s much more integration between the processor and the memory itself. If you look at an old phone from 5 or 10 years ago, the processor was on one part of the board, then the signal was routed out to the board — maybe an inch or two — to the discrete memory component, and the signals went back and forth. Now we see the trend of die stacking, and package-on-package. Now you can stack up to 16, and that’s how you get 32 gigabytes or more, because these memory chips are no more than 2 or 4 gigabytes. The integration is getting so high, for space but also speeds, that you get better signal integrity if you’re not transmitting so far down the circuit board.”
At the same time, this doesn’t mean system architects are finding it easier to make the tradeoffs. In fact, it is not uncommon for engineering teams to change their minds multiple times between DDR, LPDDR, GDDR6, and even HBM, Cadence’s Greenberg said.
“People will go back and forth between those decisions, try it out, weigh up all their options, see how it looks, and then sometimes change types after they’ve evaluated for a while,” he said. “They’re typically doing system-level modeling. They’ll have a model of their view and their neural network, along with a model of the memory interface and the memory itself. They’ll run traffic across it, see how it looks, and get a performance assessment. Then they look at how much is each one going to cost because, for example, the HBM memory stands out as having extremely high bandwidth at very reasonable energy per bit. But there are also a lot of costs associated with using an HBM memory, so an engineering team may start out with HBM, run simulations, and when all their simulations look good they’ll budget and realize how much they would be paying for a chip that has HBM attached to it. And then they’ll start looking at other technologies. HBM does offer excellent performance for a price. Do you want to pay that price or not? There are some applications that need HBM, and those devices will end up at a price point where they can justify that memory use. But there are a lot of other devices that don’t need quite as much performance as that, and they can come down into GDDR6, LPDDR5, and DDR5 in some cases.”
Fig. 1: Simulation of GDDR6 16G data eye with channel effects. Source: Cadence
Additionally, when the focus is on low power, it is assumed that LPDDR can’t be beat. That isn’t correct.
“The real low power memory is HBM,” said Graham Allan, senior manager for product marketing at Synopsys. HBM is the ultimate point-to-point because it’s in the same package, on some form of interposer. There’s a very short route between the physical interface on the SoC and the physical interface on the DRAM. It is always point-to-point, short-route, and unterminated so I’m not burning any termination power. If you look at the power efficiency, which is the energy that it takes to transfer one bit of information — often expressed in picojoules per bit or gigabytes per watt — the power efficiency for HBM is the best of any DRAM. So HBM is really the ultimate low power DRAM.
Another avenue for power reduction, which memories historically have not taken advantage of, is to separate the power supply for the core of the DRAM from the I/O. “DRAM always likes to have one power supply,” Allan said. “Everything in the whole DRAM chip is running on the same power supply, and LPDDR4 operated off a 1.2 volt supply. The data eyes were 350 to 400 millivolts tall, using a 1.5 volt supply. Somebody very smart said, ‘Why are we doing that? Why don’t we have a lower voltage supply to get the same height of the data eye? Sure, there’s going to be a little bit of compromise on the rise and fall times, because we’re not driving these transistors with the same drive strength. But it’s point-to-point, so it should be okay.’ And that’s what became LPDDR4x. The major difference between LPDDR4 and LPDDR4x was taking the power supply from the DRAM and chopping it — one power supply for the I/O, one power supply for the DRAM.”
Understandably, everyone who could go to LPDDR4x would have, because the DRAM vendors basically have one die that can support either operating voltage.
“For the host, unfortunately, if you were designed for LPDDR4, you’re going to be driving LPDDR4 signals out to the DRAM,” Allan said. “And if you’re an LPDDR4x DRAM, you’re going to say, ‘Those signals are a little bit too strong for me, and the voltage is too high. I can’t guarantee my long-term reliability isn’t impacted by what you’re giving me. So technically, you’re violating the overshoot/undershoot specs of the DRAM.’ You had to go through this process where there was a transition. Our customers were asking for help with DDR4 to LPDDR4x. At the end of the day, it’s not a huge power savings. It’s maybe in the range of 15% for the overall subsystem power savings. And that’s because the core of the DRAM takes a lot of power, you’re not changing that power supply chain, so you’re not changing how those work. You’re only changing the voltage for the I/O to transfer data across the bus. You’re doing that in the PHY on the SoC and you’re doing that in the PHY on the DRAM. Now interestingly enough, as we’ve gone from HBM2 and HBM2e to HBM3, we’ve gone from a common 1.2 voltage for HBM2e, to a 0.4 volt operating supply for the I/Os on HBM3. So we’ve reduced it by a third. That’s a big power savings, especially when there are 1,024 of these going up and down.”
Reliability concerns
Set against all of the considerations above is an increasing concern for system reliability, according to Rambus’ Woo. “Reliability is becoming a more prominent first-class design parameter. In smaller process geometries, things get more complicated, things interfere, and device reliability is a little bit harder. We’re seeing things like the refresh time/refresh interval is dropping because the capacitive cells are getting smaller. Those are all reflections of how reliability is becoming more important. The question is, with integrity being more challenging, how does the architecture change for these DRAM devices? Is it more on-die ECC or things like that being used? That’s all happening because it’s now a more prominent problem.”
So what comes next and how does the industry move forward? “We tend to see when there are issues at a component level that are really challenging to solve, either because technologically it’s hard or it’s really expensive,” said Woo. “We tend to see those things pulled up to the system level, and people trying to find ways at the system level to solve these things.”
Related
CXL And OMI: Competing Or Complementary?
It depends on whom you ask, but there are advantages to both.
What’s Changing In DRAM
The impact of shrinking features on memory.
Power Now First-Order Concern In More Markets
No longer a separate set of requirements, designers are prioritizing both power and performance in markets where performance has been the main goal.
Leave a Reply