Using Memory Differently To Boost Speed

Getting data in and out of memory faster is adding some unexpected challenges.


Boosting memory performance to handle a rising flood of data is driving chipmakers to explore new memory types and different ways of using existing memory, but it also is creating some complex new challenges.

For most of the semiconductor design industry, memory has been a non-issue for the past couple of decades. The main concerns were price and size, but memory makers have been more than able to keep up with processing demands. That is beginning to change for several reasons:

  • The overall quantity of data being generated is skyrocketing, largely due to more connected devices with sensors. Because there is too much data to ship to the cloud, it will have to be processed at the edge or even at the end point, where power is a key factor.
  • There is a bottleneck in the traditional memory/storage hierarchy. Disk drives are cheap, but they are slow in data transfer rates. Solid-state storage drives (SSDs) are slightly faster, but there are still some latency issues.
  • Memory scaling is experiencing the same kinds of issues as logic scaling, where increased density is making it harder to move data in and out of devices more quickly. This is why HBM2 DRAM is becoming more popular in high-performance devices, and why there is so much money pouring into the development of MRAM, ReRAM, FeRAM, and phase change memory.
  • AI and machine learning systems are driving new architectures that rely on massive data throughput to achieve orders of magnitude improvements in performance. Some of these architectures involve in-memory and near-memory computing, but they also are increasingly focused on reading and writing larger strings of data.

To deal with these changes, chipmakers and data scientists are beginning to rethink the fundamentals of the von Neumann architecture, and that is raising a number of issues that are new for most design teams.

Rounding errors
One approach to improving performance is to prioritize multiply/accumulate operations, rather than just distributing data randomly around a memory device.

“As you cycle through the memory, you are essentially opening a drawer and picking out one thing,” said Steven Woo, fellow and distinguished inventor at Rambus. “But it takes a lot of energy to open a drawer and pick out something that’s useful and then close that drawer. It’s like opening a bank vault, retrieving one item, and closing it back up. The overhead to open the vault is high. You’d really like to retrieve multiple items so you amortize the cost of opening and closing the vault. The energy efficiency of AI systems is extremely important, so you want to arrange your models and training data in a way that you can retrieve large amounts of data each time you need to go to memory — in effect, amortizing the high energy required to open and close the vault door.”

Another way to deal with this is to utilize fewer bits. “8-bit floats require less storage than 16-bit floats, so they take less energy to read from memory,” said Woo. “The downside is that they are also less precise. With modern neural networks performing billions of multiply-accumulate (MAC) operations, or more, the lack of precision and the way rounding is done can affect your results.”

All of that has to be taken into account with AI and machine learning systems. In fact, there is a whole body of research in computer science about how to round numbers to preserve accuracy.

“Imagine you had only one bit of fractional precision,” said Woo. “So you can only represent whole numbers or half numbers. For example, you can only represent 1.0, 1.5, 2.0, 2.5, etc. If your calculations say you need a value of 0.2, you can’t encode that value because you lack the numerical precision to do so. Rounding the number down to 0 may mean that nothing will change. And rounding up to 0.5 may mean that you overshoot how you adjust your network. There’s been some interesting work combining mixed-precision numbers to improve accuracy, as well as in alternative rounding methods like stochastic rounding. With stochastic rounding, sometimes you round up and sometimes you round down, so that on average you achieve the number you’re looking for. In the first example, for a desired value of 0.2, then 60% of the time you would round down to 0 and 40% of the time you would round up to 0.5. On average, the value will be 0.2.”

Different methods also can be combined, giving researchers a number of options for designing algorithms that trade off energy efficiency and accuracy.

The impact of AI
AI adds some unique twists into all of this.

“Another problem with AI’s usage of memory, which is quite unique, is the concept of sparsity,” said Carlos Maciàn, senior director of AI strategy and products at eSilicon. “In other domains, such as networking, memory is efficiently used, both in reading and writing, by packing data tightly, where every bit carries some information. The key word is dense. In AI, on the other hand, which is the art of approximate computing, the process of training consists of discerning the relative importance of every branch of the network model—the so-called weights.”

And this is where it gets especially interesting, because the accuracy of that data has a direct impact on the functionality of memory.

“Many of those weights end up being zero, or very close to it, so as to become irrelevant to the final result,” Maciàn said. “As a result, large portions of the network model can be ignored and conform a very sparse graph, with lots of zeroes being stored in memory. While there are approaches to compacting sparse graphs, it is also very useful to be able to identify if all weights stored at a given memory position are zero. By doing that, you can avoid performing any operation involving those weights, and so save a tremendous amount of power. eSilicon’s WAZPS memory feature does exactly that.”

Analog memory
Another new approach involves how data is captured and stored. Because much of the data being collected is analog, it is inherently inefficient to digitize that data, both from a time and energy perspective. Using the human brain as the model, neuromorphic computing aims to improve the power and performance efficiency of computing.

IBM researchers began investigating progress in this area last year, looking at the possibility of using ReRAM, phase-change memory, as well photonics to lower the amount of energy needed to move data. They concluded that mixed-signal chips in memcomputing would be a big step forward.

Others have similar viewpoints. “ReRAM has the best characteristics for storing analog values,” said Gideon Intrater, CTO at Adesto. “Today’s storage devices are designed to store just ones and zeroes. What you want to do is to be able to store analog values with a number of bits of accuracy—maybe 6 to 12 bits of accuracy. This has been done in labs, but not anything close to scaling it to the large sizes of matrix operations that we need, or which would bring it to a level that it can be productized in a repeatable fashion.”

This is a non-trivial effort, because it effectively means adding a level of abstraction to the way memories work.

“Taking this kind of memory and moving it to an analog device is quite an undertaking,” Intrater said. “Hopefully the industry in a number of years will see products where we can store data in an analog fashion. With the analog world, you have to deal with all the problems of the digital world plus the analog world. So density is a big issue.You need to store tens or hundreds of megabits on chip, and maybe even more. In addition, you need to have the capability of storing an analog value in the device and actually retrieving that analog value in a solid fashion.”

There has been much talk over the years about replacing DRAM and SRAM with a universal memory that has the speed of SRAM and the price, performance and endurance of DRAM. That isn’t likely to happen anytime soon. Both will remain staples of chip design for the foreseeable future.

That doesn’t mean the older memory technologies are standing still, however. New flavors of DRAM already are in use—GDDR6 and HBM2—with more on the way. Alongside all of this, the pipelines to move data in and out of memory are speeding up.

“We’re seeing changes to the interfaces to DRAMs, MRAM, and flash,” said Graham Allan, senior marketing manager for DDR PHYs at Synopsys. “It’s more efficient to effectively channelize the interface. This is one of the reasons LPDDR4, LPDDR4x and LPDDR5, for example, all go to 16-bit channels. It also works with the internal architecture of the DRAM so there are no bubbles in the data stream when you’re going from read to write. And DDR5 has gone to dual 40-bit. If you have a chip with eight 32-bit LPDDR4x interfaces on it, that’s effectively 16 channels. Each channel can do its own thing, some reading, some writing. This is extremely efficient in how they transfer data because you’re not using the whole interface for any one specific purpose at any point in time.”

Adding channels is a fairly straightforward way of improving performance for existing memory types.

“LPDDR5 is natively a two-channel device, but it’s very common to have more than two channels,” said Marc Greenberg, group director for product marketing at Cadence. “So the question then becomes how you map traffic from a particular channel into the core of the device. People do it with on-chip networks, but that’s a big architectural modeling problem. What cores do you want accessing what channels, and do you want every core to access every channel, and what are the implications of that? What happens if you have data in the channels that the cores can’t access, and how do you transfer that data around? Those are big architectural problems people will spend a lot of time looking at. It is a solvable problem, and everybody has to solve it because you can’t make a chip until you decide where the data is going to go. But there’s still a lot of room to run with people figuring out the best way to do that.”

Data traditionally has been randomly distributed in memory, but it’s also possible to statistically distribute data to reduce write-to-read times. There also potentially are ways to compress the data using encryption, which is still in the early stages of development.

“There are startups that are trying to encrypt data, and one of the benefits is that when you’re encrypting it you might only have to transfer 96 bits of data instead of 128 bits,” said Synopsys’ Allan. “You effectively have a higher bandwidth. But you do have a latency overhead. The more security you have, the earlier you want to encrypt that data. You want it going as far along the channel as you can, so there is a latency penalty for the encryption and decryption.”

This is still faster than current encryption and decryption approaches, and it can improve overall system efficiency and performance.

Finally, matching components can optimize performance and improve efficiency.

“You want to optimize processing power and memory bandwidth,” said Frank Ferro, senior director of product management at Rambus. “Some of the big systems vendors have optimized around AI and general-purpose processes that don’t look that good, but for that particular application they’ve tuned the curve to maximize the throughput of the GPU versus the memory bandwidth. That’s what they’re tuning for. You want to optimize your application around what’s available.”

Density issues
Along with all of these approaches, memory continues to shrink at different rates. DRAM vendors have their own schedule for increasing density, but SRAM has to shrink with whatever process the chip is in. That is beginning to cause problems, because SRAM doesn’t shrink as well as other digital circuitry. The result is that the amount of space taken up by SRAM, which typically is used for cache is growing.

“In the past, 40% of the chip was memory,” said Farzad Zarrinfar, managing director of the IP Division at Mentor, a Siemens Business. “Now it’s 60% to 70% of the chip, and in AI chips it can be 70% to 80%. Area is playing a major role in SRAM scaling, which is why we continue to increase the density.”

This is getting harder for different reasons, though. Leakage is rising as memory bit cells shrink and voltages are reduced. “Memory bits are continually dealing with ways to minimize leakage and lower the retention voltage,” said Zarrinfar. “The functional voltage is decreasing, and then we have to deal with it using HVT (high voltage threshold cells) to reduce leakage, UHVT to further reduce leakage, and LVT (low voltage threshold) to maximize speed.”

At the same time, there also are issues with performance as density increases, so there are write-assist and read-assist to optimize that density.

Storing and accessing data differently has been researched on and off for a number of years. Until the past couple of nodes, though, device scaling generated sufficient improvements in performance and power, so little progress was made in this area. But as the benefits of scaling continue to fall off, architectural changes are becoming increasingly critical for PPA improvements. This is particularly important as the amount of data being processed and stored continues to explode, and it has added a sense of urgency to memory research and development.

That also is driving research into new memory types and packaging, as well as better ways to prioritize and access data. While memory has always been an integral part of computing, it is getting a second look as chipmakers look to get big boosts performance for less power, something they used to expect with scaling.


Kevin Cameron says:

At the end of the day in-memory computing is the only option – a volume of data only has a surface for communication, so trying to move the data in and out of storage leaves you a dimension out of luck. Aside from that, if you are looking at analog/mixed-signal approaches, it would be good to fix the simulation tools, I’ve been waiting for over two decades for someone to take an interest in that.

Gil Russell says:

Flash Memory Summit 2019 commences on Tuesday next week. Supposedly there will be information on “Computational Memory” that executes off DIMMs running Hyperdimensional Computing and such items that were only the realm of “Research” prior to this time. Be nice to see you there.

Leave a Reply

(Note: This name will be displayed publicly)