Pressure is building to change the programming paradigm associated with memory, but so far economic justifications have held back progress.
The relationship between a processor and its memory used to be quite simple, but in modern SoCs there are multiple heterogeneous processors and accelerators, each needing a different means of accessing memory for maximum efficiency.
Compromises are being made in order to preserve the unified programming model of the past, but the pressures are increasing for some fundamental changes. It doesn’t matter what segment of the market the chips are designed for. The end of Moore’s Law for many, the end of Dennard scaling for everyone, and the need to get the most out of the available transistors, means that architectural choices are being questioned. Memory is being looked at closely.
Power is one driver. “Most silicon designs use various technologies for reducing power consumption,” says Anoop Saha, market development manager at Mentor, a Siemens Business. “Improving the memory accesses is one of the biggest bang-for-the-buck architecture innovations for reducing overall system-level power consumption. That is because an off-chip DRAM access consumes almost a thousand times more power than a 32-bit floating point multiply operation.”
There are very big implications with changing the memory architecture. “The challenge is that in the past, people had a nice, abstract model for thinking about computing systems,” says Steven Woo, fellow and distinguished inventor at Rambus. “They never really had to think about memory. It came along for free, and the programming model just made it such that when you did references to memory. It just happened. You never had to be explicit about what you were doing. What started to happen with Moore’s law slowing, and when power scaling stopped, is people started to realize there are a lot of new kinds of memories that could enter the equation, but to make them really useful you have to get rid of the very abstract view that we used to have.”
Even looking at memory in the traditional manner means constant change. “The problem has changed over time as memories get bigger and the interfaces to those memories get more bandwidth associated with them,” says Marc Greenberg, group director for product marketing in the Cadence IP Group. “The memory architectures we see today, with a memory like DDR5 or LPDDR5 or GDDR6, are different from what we would have seen with DDR1 20 years ago. But everyone is still trying to resolve the same problem, which is how can I get the right amount of memory capacity available to my processing element in a minimum amount of time and at reasonable cost.”
Rethinking needs
There are a few standard techniques the industry has relied on in the past, such as caching, to move data closer to the processor. “If the data is closer, it means it uses less power and it also means that you can run those interfaces faster,” says Rambus’ Woo. “Now that we may not have better logic on the cadence that we are used to, the challenge is to start thinking about different ways to get the performance gains. This is why people are focusing on the newer architectures, domain-specific architectures, and focusing more on efficient data movement — making sure it is close to where it is used.”
Domain specificity will be a big part of this. “Designers need to make choices in the context of the overall system and the application or workload,” says Mentor’s Saha. “Some of the microarchitecture choices they need to make are related to memory hierarchy with a balance between latency and bandwidth – local vs shared vs global vs off-chip. It is also important to optimize the algorithm to improve data locality so as to minimize data movement. These choices are dependent on the specific workloads that the chip is designed to run. For example, image processing accelerators use line buffers (which works on only a small sample of an image at a time) whereas a neural network accelerator uses double buffer memories (as they will need to operate on the image multiple times).”
We see similar micro-architectural choices being made with high-level synthesis tools. While they are creating custom logic solutions, the relationship between the logic and memory is often at the heart of the micro-architectural choices made during the synthesis process. “The ability to fine-tune the microarchitecture choices for these designs requires not only moving to a higher level of abstraction, but also doing frequent C-to-GDS flows to measure the PPA impact of the design changes,” adds Saha.
Domain-specific architectures often come with new languages and programming frameworks. “These often create new tiers of memory and ways to cache it or move the data so that it is closer to where it needs to be,” says Woo. “It adds a dimension that most of the industry is not used to. We are not really being taught that kind of thing in school, and it’s not something that the industry has decades of experience with. So it is not ingrained in the programmers.”
High-end systems are constantly looking at new interfaces and structure to help improve performance. “The biggest bottleneck reduction being brought to market today is associated with HBM memory,” says Brett Murdock, senior product marketing manager for Synopsys. “This is a slow but steady breakthrough that is helping the situation. Customers can use HBM memory to achieve high bandwidth, thus the name, and they can use it like a level 4 cache in the system. It gets a lot of data a lot closer to the processor. They will still use a DDR5 style of memory to do the bulk of the data storage, but this memory is extra cycles away.”
While it is easy to concentrate on the high-end systems, the same issues are being felt at the lowest levels, as well. Consider a small MCU working with flash memory. “The MCU has to do all of the work,” says Paul Hill, senior marketing director for Adesto’s standard products group. “It is treated like a child and has to be told everything. When you start a program cycle, the MCU has to send the data, start the program cycle, and then monitor the situation until the end of the program cycle. In addition, when you have to program data, you have to check the memory to make sure it is capable of accepting the data. If not, you have to erase a section and start the program cycle. Every time the MCU accesses the memory, it has to perform checks and tests and perform function control. Why can’t the memory device become a more integral part of the system and the memory device do a lot of this on its own, and why can’t the memory device become an intelligent peripheral?”
New applications
It is always easier to change something when there are no entrenched solutions. This is what happened with graphics processing, where the compute paradigm was different enough that it was impossible to get the performance or productivity using the tools and flows designed for general-purpose CPUs. Change took time, but eventually it did happen.
Today two other application-areas are driving change. The first is artificial intelligence (AI) based on neural networks. “Most AI accelerators made significant efforts in reducing off-chip memory accesses,” says Saha. “For example, one of the major architectural changes in Google’s first TPU was a dramatic increase in on-chip SRAM. This avoided spilling activations to off-chip DRAM. Other ASICs have dramatically increased the memory bandwidth. Some companies are also looking at ‘in-memory compute’ as a complete do-over from the traditional von Neumann architecture.”
Geoff Tate, CEO of Flex Logix, explains the AI memory problem. “The key computational element in neural networks is the multiply/accumulate function (MAC). Each one requires accessing a weight value from memory (1 of 62 million in the case of YOLOv3), and YOLOv3 takes more than 300 billion MACs to compute a 2-megapixel image. The number of weights required by CNN object detection/recognition neural networks and other megapixel image processing algorithms are tens of millions. This is too much to store on-chip for any cost-effective chip suitable for edge computing, so weights must be stored in DRAM. The objective in designing the memory subsystem should be to minimize DRAM transfers, since they are much higher power per byte than on-chip accesses, and to store information as close to the MACs as possible.”
While every AI chip is approaching this problem in a different way, domain specificity can be used. For example, with an AI accelerator that processes layer-by-layer, it is possible to anticipate what memory contents will be required ahead of time.
“While layer N is being processed, the weights for layer N+1 are brought in from DRAM, in the background, during computation of layer N,” says Tate. “So the DRAM transfer time rarely stalls compute, even with just a single DRAM. When layer N compute is done, the weights for layer N+1 are moved in a couple microseconds from a cache memory to a memory that is directly adjacent to the MACs. When the next layer is computed, the weights used for every MAC are brought in from SRAM located directly adjacent to each cluster of MACs, so the computation access of weights is very low power and very fast.”
Another application area that is changing rapidly is the Internet of Things (IoT). Only just a few years ago, large industry players suggested that there would be billions of sensors, each sending data to the cloud for processing.
“A lot of people talk about edge computing being an answer to the fact that when you have all of these connected devices, it is just impossible to move all of that data to the cloud,” says Rambus’ Woo. “So, this is a great example of where processing is moving closer to the data. There is also an interesting security aspect to this, as well. Processing the data next to where the data is generated leaves open fewer security holes. There can still be security concerns with the devices, but on top of that, it is a distributed processing problem.”
Many IoT systems are based on MCUs, and most utilize flash memory. Flash memory has long had to contain processing capability to deal with issues such as wear leveling, but always has attempted to maintain the programming paradigm of other memory types. “Using a few simple techniques we can demonstrate 5X faster performance using up to 70% less power,” says Adesto’s Hill. “A memory device consuming 5 or 6mA is insignificant when the MCU is consuming an order of magnitude more than that. So we look at how the memory works with the system and ask if we can reduce the power of the MCU itself. These changes modify the way in which the MCU works with the memory.”
Hill points to one way this can be done. “On the left (Figure 1) is the traditional serial flash. The orange block is only used for programming data. You write the data to the buffer, and as soon as you deactivate the chip select the buffer starts to program the flash array. We made a simple change in that we made the SRAM buffer bidirectional. This has a large number of benefits. Another change is that typically, the block array is 4K, 32K or 64K. If you want to modify just one byte of data, anywhere in a serial flash, you have to ensure the entire block is erased first. Erasing that block can take up to 60mS, which in an MCU-based system is an eternity. During that 60mS, you cannot do anything else with that memory device. It is effectively offline. We changed the architecture by increasing the granularity of the block size down to 256 Bytes. This is what we call a page, and we made the device page erasable. Simply, if you want to erase and change 1 Byte, it makes sense to do that at the 256-Byte level.”
Fig. 1: Architectural changes made in a flash memory device. Source: Adesto
Another change Adesto made was to allow a degree of autonomy within the memory. Instead of having the processor wait for the completion of an operation, the memory will send an interrupt when it is ready. The processor can thus go to sleep and save power while this operation is being done.
We see other memory pressures being brought by the IoT. “One thing the industry has not done particularly well over the past five years is to build memories that are small enough for usage in applications like the IoT,” says Cadence’s Greenberg. “They would like to build a chip that is low power but utilize some amount of DRAM. What are their options? It will probably not be LPDDR5. You may have to look at older memory technologies like LPDDR3 or even earlier to be able to find a memory device that is small enough to fit that application. IoT is a special case that is at the opposite end of the spectrum from where servers, smart phones and AI live. They want more memory than can be put on the die economically in SRAM or eDRAM, but less than what is demanded by high-end applications.”
New approaches
There are several moving parts — the memory itself, the interface to the memory, and the paradigm of operation. Each of these is being looked at individually and collectively based on needs. However, the biggest gains are likely to be achieved when both the hardware and software change.
“Programmers will have to become more aware of what the memory hierarchy looks like,” says Woo. “It is still early days, and the industry has not settled on a particular kind of model. But there is a general understanding that in order make it useful, you do have to increase the understanding about what is under the hood. Some of the programming models, like persistent memory (PMEM), call on the user to understand where data is — and to think about how to move it and ensure that the data is in the place that it needs to be.”
Notions about hardware/software co-design scare many in the industry. “The industry is starting to realize that hardware on its own cannot be developed without input from the software team,” says Hill. “Most software consumes all available resources, and they have relied upon a steady increase in processing power. When we come up to a real-time situation, they tend to throw more horsepower at the problem. But the memory device can no longer be treated as an isolated hardware component. To get the best possible usage of the system, the software engineer must become an integral part of the design and development cycle. These features may be implemented in hardware, but unless the software engineer makes use of them, they will bring no benefit to the system.”
Healthy skepticism remains. “For this utopian vision to occur, the CPU manufacturer, the OS provider, the memory manufacturer, and the application designer all have to co-design their system to take advantage of true differences in architecture,” says Greenberg. “We haven’t seen that except in a few closed systems where a manufacturer might control both the hardware and the operating system. But, as an industry, we are not ready or able to live in a world where the hardware is co-designed with the OS, the applications are co-designed with the OS, the memory is co-designed with the hardware, and everything done together. So we segment it, and each of those links is designed to work well independently of the others. We end up with a system that works acceptably well, but may leave some potential optimization behind. The prospect of being able to do compute in memory is ever-present, but we have never really been at a place where it makes economic sense to construct a true purpose-built DRAM for a specific application.”
Still, nobody is willing to say it will not happen at some point. “We continue to watch the space because it does have an impact on the memory hierarchy,” says Woo. “It has been a long time since new memories and new memory hierarchies have come to market, and it is getting to the point that the industry realizes that something like this could be very useful.”
AI/neural-networks, and in-memory computing are old problems, or at least they look like problems we have tackled before. Neural-network eval is much the same as circuit simulation, and in-memory computing was what Inmos were doing back in 1990.
And PMEM isn’t new either, that’s the same as HDDs (it’s just solid state now).
” Every time the MCU accesses the memory, it has to perform checks and tests and perform function control. Why can’t the memory device become a more integral part of the system and the memory device do a lot of this on its own, and why can’t the memory device become an intelligent peripheral?””
Thanks for your thoughts on this vital topic. Regarding your statement above, this is exactly the problem that the Gen-Z “semantic” model is designed to address. Hopefully we’ll see a robust adoption curve on this new paradigm over the next few years.
Another great article, Brian. Shared memory vs. message passing was a raging debate when I was in school. I was a bit disappointed that shared memory was generally viewed as an easier programming model. (Then I ended up spending years designing hardware to deal with the complexities of implementing shared memory.) This was an off-chip question then, and as we scale, as with everything in computer architecture, it’s now an on-chip issue. Nice summary of where we stand today.
While the problems may have been seen before, they have not been solved successfully. For it to be solved it has to be deployable by more than one person. In many cases, the software side of things has never been properly addressed and the disruption levels are seen as being too high. That changes as the risk/reward levels change.