Pushing Memory Harder

Can the processor/memory bottleneck be closed, or do applications need to be re-architected to avoid it?


In an optimized system, no component is waiting for another component while there is useful work to be done. Unfortunately, this is not the case with the processor/memory interface.

Put simply, memory cannot keep up. Accessing memory is slow, and it can consume a significant fraction of the power budget. And the general consensus is this problem is not going away anytime soon, despite efforts to push memory harder, faster and to utilize less power?

“There are charts that show how bad the memory bottleneck is getting,” says Steven Woo, fellow and distinguished inventor at Rambus. “If you can avoid going to memory you should. That is just good standard practice. If you can avoid going to the network, you should. If you can avoid going to disk, you should. But when you look at applications like AI and the growth of the network sizes that people are wanting to implement, they are growing faster than other technology curves can keep up. The only recourse is to use DRAM. Of course, people want to avoid using memory, but I don’t see how that is practical for the tougher, more demanding networks.”

The problem is that as soon as you are forced to stop using SRAM, which is integrated on-chip, performance and power quickly become problematic. Studies have shown that when analyzing the energy consumed by a simple mathematical operation like a multiply or add, a large fraction of the power is spent setting up the computation.

“If you go off-chip to DRAM to get the data, a lot of energy is spent doing that and moving the data around, then across the chip,” says Woo. “That can be 95% of the power. What this tells you is that it is an energy problem that we have, and this is why everyone wants to bring computation and data closer together.”

Memory and standards bodies are not standing still. “Now is an interesting time for memory,” says Vadhiraj Sankaranarayanan, technical marketing manager for Synopsys. “We have LPDDR5 that was released by JEDEC earlier this year, and the new DDR5 will be released shortly. These memories are taking the speeds to a higher level than their predecessors. Both LPDDR5 and DDR5 will both have a max speed of 6400Mbps, and that is a considerable speed increase. The memory standards are evolving to both increase the performance, and architecturally by addressing reliability, availability and serviceability (RAS), which is directly tried to the robustness of the channel. Some of the new low-power features also will bring system power down.”

Moving memory closer to processing
Shorter wires have lower capacitance, which helps both performance and power. “The goal is to minimize the overall latency,” says Karthik Srinivasan, senior product manager for ANSYS. “That makes the memory appear as close to the compute as possible. For that, the best we have so far is high-bandwidth memory (HBM), which actually brings the memory into the package rather than having them 5 or 6 cm apart. With HBM, memory is a couple of millimeters away from the compute. The next best thing is to have on-chip memory, which offers higher bandwidth and very minimal latency.”

But you may have to work hard to make the application fit into on-chip memory. “To overcome the memory bottleneck, you have to make partitioning decisions,” says Farzad Zarrinfar, managing director of IP at Mentor, a Siemens Business. “You have to make decisions about what you want to do in an embedded fashion and what will use external memory. We believe that if you can minimize data movement or to reduce the length of data movement, you improve your performance and minimize power.”

There are always tradeoffs. “HBM is the clear winner for devices that need very high bandwidth with the lowest energy per bit metric, and also be able to do it in a constrained amount of area of the PCB,” says Marc Greenberg, group director for product marketing in the IP Group of Cadence. “There are other metrics where HBM does not stack up so well. You can expect all of the DDR technologies to be lower in cost than HBM.”

Synopsys’ Sankaranarayanan adds a few more advantages that come with HBM. “For each HBM DRAM you save a lot of area because you do not need as many GDDR PHYs or DRAMs to get the same bandwidth. HBM also has a very good power efficiency compared to GDDR. So, HBM provides a lot of bandwidth, better power efficiency, area efficiency but the important thing is that it requires an interposer to integrate the SoC and the HBM. That makes it more costly.”

The incorporation of an interposer is not a decision made lightly, and the industry is still in the learning phase for several aspects of it. “How stable is the overall structure?” asks ANSYS’ Srinivasan. “What is the overall reliability? When you have thermal cycling, how will that impact warpage fatigue? Organic substrates are much thicker and more stable, but we are looking at much thinner silicon, especially when stacking multiple dies. As you put more compute and more memory into a smaller form factor, the power density is higher, which impacts thermal which in turn impacts the fatigue, warpage etc. There is a stated need in the industry to look at the structural aspects of these multi-die systems.”

Architecting the right memory
Von Neumann established the architecture that we use for computing systems. It was simple, scalable and flexible. But today, every decision has to be re-examined, and it may not offer the best solution for all problems.

This is particularly true as more power states and use cases are designed into devices, and it can include non-volatile memories such as flash, MRAM, and phase-change memory, as well as volatile memories such as DRAM and SRAM.

“We had a customer recently working on a Bluetooth Low Energy application where they were constantly accessing the device in read mode,” said Paul Hill, director of marketing at Adesto Technologies. “They’re fetching the software from the device, flowing that into the cache and executing the code. For that reason, they have a power consumption issue in read mode. But occasionally, when the BLE device goes dormant, they want to turn off the chip’s memory device and go in through an ultra-low-power mode. The problem with that is the ultra-low-power mode has a longer wake-up time, so when the BLE device goes active again there is a longer latency before they can get the next read instruction. We have to take that into account. We have different power modes that our memory device can operate in. There is standby mode, low-power mode and ultra-low-power mode. The customer can then determine what mode is appropriate.”

That also has an impact on where exactly the data is being stored, which has been a long-running problem in the memory world.

“Data locality is an issue that the industry has struggled with for many years and it not completely solved,” says Cadence’s Greenberg. “When people are doing dedicated hardware, they can control the locality of data a little better by matching the memory management to the application. They may be able to organize data to all be in the same page of memory, which would reduce the amount of power used. Also, you want to organize data in a way that, if you request a burst of data from the memory, you actually are able to use all of the data requested. Some algorithms do that very well. Some things like cache line fills and evictions do that very well. Sometimes video data is poor at that.”

It is this problem that caused GDDR to be created. “If you look at the graphics industry, they are a great example of CAS granularity,” adds Woo. “The application has a natural granularity of data that it wants, and giving it more than that is detrimental. It wastes resources, it destroys caching algorithms, etc. Graphics wants 32-byte accesses. A couple of times the industry has tried to make a graphics memory with 64-byte access granularity, and both times the future standards went back to 32-byte access granularity. With GDDR6 compared to GDDR5, instead of having all 32-bits of the DRAM interface dedicated to a single request, we split it up into two 16-bit interfaces and treat them almost like two separate DRAMs. Because each is now half the width, you can afford to dump out twice as many bits on those wires and still get the same column granularity. So it is 16 bits wide, but it is twice as deep, and that helps the design of the core. It more closely matches the DRAM desire to dump out more data per request on each wire, with the more natural granularity wanted by the application.”

Is it possible that AI will have specific needs in terms of channel width that eventually a new memory standard will be optimized for this application? “Mobile phones used to use DDR and then when volume got high enough, and the needs became different enough from the mainstream, that it warranted a new standard,” Woo says. “The same happened with graphics. Early graphics used DDR, and eventually there was enough volume and demand to develop a new type of memory. Something similar could happen for AI. The question is, as the needs for that community evolve, does HBM no longer meet those needs? And do you need something even more specific? The things that will determine that are the longevity of the demand for this type of use case. If there is money available and enough volume, that could motivate enough DRAM manufacturing capacity to build to a new standard. Historically, we have seen that the market will come in and build the new standard around it.”

Fig. 1: Common memory systems for AI applications. Source: Rambus

Driving memory faster
All of the memory interfaces reside under the standards umbrella of JEDEC, an organization that has existed since 1958. As previously noted, JEDEC is actively advancing all of the memory standards. “There are four current and next-generation standards that are all under active development,” explains Greenberg. “GDDR6 is out and we expect faster GDDR6 parts over time. HBM2E has been out for a short period of time and there was an announcement in September about a higher speed grade of device that is in excess of the JEDEC standard. We have seen announcements from memory vendors about their plans for DDR5. We are in the early days of that standard, so we can expect that to get faster over time. And LPDDR5 standard has the potential for a mid-life extension to the frequency range of that technology.”

Signal integrity (SI) and power integrity (PI) are often the limiting factors for how fast the interface can operate. “DDR is architected for module-based memory systems that are found in servers and PCs,” says Woo. “That bus topology is not as clean, from a signal integrity perspective, where you have to go through these connectors and discontinuities. That is a reason why it is harder to scale the speeds so quickly. But if you take a look at HBM or GDDR, where it is soldered straight to a board, it is a much cleaner interface and doesn’t have to go through a connector. That makes it a little easier from a signal integrity standpoint, at least up until now, to ramp the speed. The physical implementation is cleaner.”

Still, a lot of analysis has to be performed. “SI simulations are used to model the channel, along with the power delivery network, in order to ensure the fidelity of the signals from the drivers to the receiver is actually maintained,” says Srinivasan. “You also need to consider a significant amount of coupling or loss because of power noise or the coupling between various interconnects. One of the biggest challenges is the simulation capacity. For HBM you are looking at a 128-bit channel for each stack. You have to simulate the entire signal traces and this traverses from one die to another using through silicon vias (TSVs) down to the interposer traces, across to the parent logic die, along with all of the power delivery network.”

The standards also build in advanced capabilities to ensure reliable communications. “The intent of every standard is to offer a higher speed and at a lower I/O voltage,” says Sankaranarayanan. “We do not want to lose sight of power. So you are increasing the speed and bringing down the voltage, and that is accomplished through electrical and architectural features. From the electrical point of view, a new feature in LPDDR5 is decision feedback equalization (DFE). What this does is to open up the eye for the right data. As the data is sent by the PHY, the DRAM captures that, and the DFE would be on the front end of that and would open up the margin of the eye. So as the sampler reads the data, you have a higher probability of capturing it correctly. It is quite common for the controller PHY to have DFE for the read data. As the speeds increase, these measures taken on the channel allow us to operate with higher reliability.”

For many applications, the processor/memory bottleneck is here to stay, although it will improve at times and get worse at others as standards evolve. High-volume applications do have the ability to introduce new memory architectures and interfaces, as has been witnessed a couple of times already for mobile and graphics. An interesting question for the future is whether AI/ML, or new non-volatile memories, will bring about new memory standards.

Related Stories
Solving The Memory Bottleneck
Moving large amounts of data around a system is no longer the path to success. It is too slow and consumes too much power. It is time to flip the equation.
Will In-Memory Processing Work?
Changes that sidestep von Neumann architecture could be key to low-power ML hardware.
Using Memory Differently To Boost Speed
Getting data in and out of memory faster is adding some unexpected challenges.
In-Memory Computing Challenges Come Into Focus
Researchers digging into ways around the von Neumann bottleneck.
Machine Learning Inferencing At The Edge
How designing ML chips differs from other types of processors.
Memory Knowledge Center
Special reports, top stories, videos, technical papers and blogs about Memory.


Tanj Bennett says:

One problem we see with these futures is inadequate error correction. At present the only scheme with a good enough ECC design for reliable computation is DDR4 in a x4 configuration with 2 spare chips for SDDC/Chipkill. LPDDRx has just single bit correction and weak integrity which works ok for consumer devices with low chip counts, but does not scale up for servers which will have hundreds of DRAM chips per CPU socket. All of the wide chips – x8, LPDDRx, GDDRx, HBM – everything which delivers more than 4-bit wide per chip are lacking an ECC solution.

To paraphrase Dijkstra, who was talking about algorithms, the fastest solution first has to be a correct solution. I agree with this article that we are on the verge of moving to much closer, faster coupling of memory to CPU. But the error rates and need for correction have been a blind spot, which need fixing before this can truly happen.

Brian Bailey says:

Thanks for the comment Tanj. It is always difficult to decide what to put into an article and what may make it too long. Here is an additional comment made by Synopsys’ Sankaranarayanan that did not make it into the article – and it may provide some hope for you:

In DDR4, the ECC gets generated by the controller and gets sent and stored in separate DRAMs and then as the data is read, the ECC is also read and then the controller would ascertain if everything was correct. If there is a single bit error, the controller can correct and send the correct data to the SoC. Now in DDR5 it understands the importance of transient errors, and the higher probability of transient errors that can happen in the memory array because the array itself is getting denser. So, there are some additional spares that have been introduced to store the ECC bits for every write data so that inside the DRAM, as the DRAM gets read out, the ECC can correct it there and send the corrected data to the controller.

Tanj Bennett says:

Unfortunately that is not enough. The single bit errors if left like that can eventually combine with another error and become uncorrectable (see field studies of DRAM errors). The DDR5 approach makes things worse: it does not really keep enough information to allow the CPU to locate and avoid using that memory (planned fault avoidance), and more importantly, it creates a 136 bit structure inside the DRAM. A significant fraction of DRAM errors are faults in the column and row structures (which are inherently at the highest resolution used). The 136 bit structure, all activated at the same time, becomes a 136 bit error domain. DDR5 patches over the easy errors and in so doing made the difficult errors worse.

With DDR4 and DDR5 in reliable computers this is solved by using 4-bit wide chips and 2 redundant chips, so the fault in any one chip (no matter how many bits) can be corrected. You cannot do this with wide chips like LPDDRx or HBMx.

While hard data on this is not released, it is probable that the multi-bit errors occur around 10 FIT (failures in time, per billion hours) per DRAM chip. For consumer products this is generally acceptable. Your cellphone likely has 4 DRAM chips and your laptop maybe 8. The CPU could have FIT rates of 200 to 300 so the DRAM is not your largest problem. A FIT rate of 114 is about 1 failure per year per thousand devices, actually not bad for consumer gear.

However, a server is likely to have around 320 DRAM chips per CPU chip, so in that case 10 FIT is clearly bad news, it would be one of the largest failure rates in the machine. This is why servers use 4 bit wide DRAM with 2 spares.

The problem is, as you say, the architectural demand for faster memory is also a demand for wider memory, and it is some of those high end servers where the high performance is interesting. But if you lose the use of chipkill now you are back to staring down those nasty multi-bit errors in the fine structure of the chips. Solving that is going to be necessary.

Brian Bailey says:

Thanks Tanj – I really appreciate the feedback and this additional information. It is always hard knowing when you have the full story and not just the parts that vendors want to be heard.

Leave a Reply

(Note: This name will be displayed publicly)