The Future Of Memory

From attempts to resolve thermal and power issues to the roles of CXL and UCIe, the future holds a number of opportunities for memory.

popularity

Experts at the Table: Semiconductor Engineering sat down to talk about the impact of off-chip memory on power and heat, and what can be done to optimize performance, with Frank Ferro, group director, product management at Cadence; Steven Woo, fellow and distinguished inventor at Rambus; Jongsin Yun, memory technologist at Siemens EDA; Randy White, memory solutions program manager at Keysight; and Frank Schirrmeister, vice president of solutions and business development at Arteris. Part 2 of this roundtable can be found here. What follows are excerpts of that conversation.

[L-R]: Frank Ferro, Cadence; Steven Woo, Rambus; Jongsin Yun, Siemens EDA; Randy White, Keysight; and Frank Schirrmeister, Arteris.[L-R]: Frank Ferro, Cadence; Steven Woo, Rambus; Jongsin Yun, Siemens EDA; Randy White, Keysight; and Frank Schirrmeister, Arteris.

SE: How will CXL and UCIe play in the future of memory, especially given data transfer costs?

White: The primary goal for UCIe is interoperability, along with cost reduction and improving yield. So from the get-go, we’ll have better overall metrics through UCIe, and that will translate not just to memory, but other IP blocks. On the CXL side, with a lot of different architectures coming out that are focused more on AI and machine learning, CXL will play a role to manage and minimize costs. Total-cost-of-ownership is always the top-level metric for JEDEC, with power and performance being secondary metrics that feed into that. CXL is basically optimized for disaggregated, heterogeneous compute architectures, reducing over-engineering, and designing around latency issues.

Schirrmeister: If you look at the networks on-chips, like AXI or CHI, or OCP as well, those are the on-chip connection variants. However, when you go off-die or off-chip, PCIe and CXL are the protocols for these interfaces. CXL has various use models, including some understanding of coherency between the different components. At the Open Compute Project forum, when people talked about CXL, it was all about memory-attached use models. UCIe will always be one of the options for chip-to-chip connections. In the context of memory, UCIe may be used in a chiplet environment, where you have an initiator and a target, which has attached memory. UCIe and its latency then plays a big role in how all that is connected and how the architecture needs to be structured to get data in time. AI/ML architecture is very dependent on getting the data in and out. And we haven’t figured out the memory wall yet, so you have to be architecturally smart about where you keep the data from a systemic perspective.

Woo: One of the challenges at the top level is that datasets are getting larger, so one of the issues that CXL can help address is being able to add more memory on a node itself. The core counts of these processes are getting higher. Every one of those cores wants some amount of memory capacity by itself. And then on top of that, the datasets are getting bigger, so we need a lot more memory capacity per node. There is a plethora of usage models now. We’re seeing more usages where people are spreading data and computation among multiple nodes, especially in AI, with big models that are trained across lots of different processors. Protocols like CXL and UCIe provide pathways to help processors flexibly change the ways they’re accessing data. Both of those technologies will give programmers the flexibility to implement and access data sharing across multiple nodes in ways that make the most sense to them, and that address things like the memory wall, as well as power and latency issues.

Ferro: A lot has already been said about CXL from the memory pooling aspect. From a more practical cost level, because of the size of the servers and chassis in data centers, although you can stick more memory there, it’s a cost burden. The ability to take that existing infrastructure and continue to expand out as you move into CXL 3.0 is important to avoid these stranded memory scenarios, where you have processors that just can’t get to memory. CXL also adds another layer of memory, so now you don’t have to go out to storage/SSD, which minimizes latency. As for UCIe, with high-bandwidth memory and these very expensive 2.5D structures that are starting to come about, UCIe may be a way to help separate those and reduce the cost, as well. For example, if you’ve got a large processor — a GPU or CPU — and you want to bring memory very close to it, like high bandwidth memory, you’re going to have to put that fairly big footprint on a silicon interposer or some interposer technology. That’s going to raise the cost of the whole system because you’ve got to have a silicon interposer to accommodate the CPU, DRAM, and any other components you might want to have on there. With a chiplet, I can just put the memory on its own 2.5D,  and then I can possibly keep the processor on a cheaper substrate and then connect it through UCIe. That’s a use model that can be interesting for how to reduce that cost.

Yun: At IEDM, there was a significant amount of discussion about AI and different memories. AI has been rapidly increasing the handling parameters, growing about 40 times larger in less than five years. Consequently, a tremendously large amount of data needs to be handled by AI. However, the DRAM performance and the communication in the board have not reached that much of an improvement, only about 1.5 to 2x improvement every two years, which is far less than the actual demands that are coming from the AI improvement. This is one example of why we try to improve the communication between the memories and the chip. There is a substantial gap between the data supply from memory and the data demand by the computational power of AI, which still needs to be resolved.

SE: How can memory help us solve power and thermal issues?

White: Power issues are the memory’s problem. Fifty percent of the cost within a data center comes from memory, either just the I/O or refresh management and cooling maintenance. We’re talking about volatile memory — DRAM specifically. As we’ve said, the amount of data out there is huge, the workloads are getting more intense, speeds are getting faster, and that all translates to higher energy consumption. As we scale, there have been a number of initiatives to meet the bandwidth needed to support the increasing core count. Power scales accordingly. There are some tricks we’ve played along the way, including reducing the voltage swing, the power rail that improves to the square function for the I/O. We’re trying to be more efficient about memory refresh management, using more bank groups, which also improves overall throughput. A few years ago, a customer came to us and wanted to propose a significant change within JEDEC and how the memory was specified in terms of temperature range. LPDDR has a wider range and has different temperature classifications, but for the most part we’re talking about commodity DDR, because that’s where the capacities are, and it’s the most predominantly seen in the data center. This customer wanted to propose to JEDEC that if we could increase the operating temperature of DRAM by five degrees — even though we know the refresh rate would increase with the higher temperature — that would in turn reduce by three coal power plants per year what would be needed to support this increase in power. So what’s done at a device level translates to a macro change on a global basis, on the level of power plants. In addition, there’s been over-provisioning in memory designs for quite a while at the architectural level. We came out with this PMIC (power management IC), so voltage regulation is done at the module level. We have onboard temperature sensors, so now the system doesn’t need to monitor the temperature within the chassis. Now you have specific module and device temperature and thermal management to make it more efficient.

Schirrmeister: If DRAM were a person, it would definitely be socially challenged, because people don’t want to talk to it. Even though it’s very important, nobody wants to talk to it — or want to talk to it as little as possible — because of the cost involved in both latency and power. In AI/ML architectures, for example, you want to avoid adding significant cost, and that’s why everybody is asking if data can be stored locally or be moved around in different ways. Can I arrange my architecture systemically so that the computing elements receive the data at the right time in the pipeline? That’s why it’s important. It has all the data. But then you want to optimize for power when you optimize for latency. From a systemic perspective, you actually want to minimize the access. That has very interesting effects for the data transport architecture for the NoC, like people wanting to carry the data around, keeping them in various local caches and basically designing their architecture to minimize, from a social aspect, the access to the DRAMs.

Ferro: As we look across different AI architectures, a lot of the first goal is to try to keep as much locally as you can, or even avoid DRAM altogether. There are some companies that are putting that forward as their value proposition. You get orders of magnitude gains in power and performance if you don’t have to go off-chip. We’ve talked about the size of the data models. They’re getting so big and unwieldy that it’s probably not practical. But the more you can do on-chip, the more you’re going to save on power. Even the concept of HBM, the idea of going very wide and very slow was the intent. If you look at the earlier generations of HBM, they had DDR at speeds like 3.2GB. Now they’re up into 6GB, but still relatively slow for a DRAM just going very wide, and this generation they even lowered the I/O voltages to 0.4 to try to keep that I/O down. If you can run the DRAM slower, that’s going to save power at the same time. Now you’re taking memory, you’re putting it very close to the processors. Then you’ve got a bigger heat footprint in a smaller area. You’re improving some things, but making other things more challenging.

Schirrmeister: To build on Frank’s point, the North Pole AI architecture from IBM is an interesting example. If you look at it from an energy efficiency perspective, most of the memory is essentially on-chip, but that’s not feasible for everybody to do. Essentially, it’s the extreme case of let’s do as little damage as possible and offer as much as we can on-chip. The research at IBM has shown that works.

Woo: When you think about DRAM, you have to be very strategic about how you use it. You have to think a lot about the interplay between what’s above you in the memory hierarchy, which is SRAM, and what’s below you, which is the disk hierarchy. With any of those elements in the memory hierarchy, you don’t want to be moving a lot of data around if you can avoid it. When you do move it, you want to make sure you’re using that data as much as you possibly can to amortize that overhead. The industry has been very good at responding to some of the critical demand. If you look at the evolution of things like low power DRAM, and HBM, they were responses to the fact that the standard memories weren’t meeting certain performance parameters like power efficiency. Some of the paths forward that people are talking about, especially with AI being a big driver, are ones that not only improve the performance, but also the power efficiency — for example, trying to move toward taking DRAM and stacking it directly on processors, which will help both the performance and the power efficiency. Going forward, the industry will respond by looking at changes to architectures, not only incremental changes like the low power roadmap, but larger ones as well.

SE: In addition to what we’ve been discussing, are there other ways memory can help solve latency issues?

White: We’re pushing out compute, and that will address a lot of the needs around edge compute. Also, the obvious benefit with CXL is instead of passing data, now we’re passing pointers to memory addresses, which is more efficient and will reduce overall latency.

Schirrmeister: There’s a power issue there, as well. We have CXL, CHI, PCIe — all these items have to play together on-chip and chip-to-chip, especially in a chiplet environment. Imagine being in the back office, and your data is peacefully running across a chip with AXI or CHI, and now you want to go chiplet-to-chiplet. You suddenly have to start converting things. From a power perspective, that has impact. Everybody’s talking about an open chiplet ecosystem and making exchanges between different players. In order for that to happen, you need to make sure you don’t have to convert all the time. It reminds me of the old days when you had something like five different video formats, three different audio formats, and all of them needed to be converted. You want to avoid that because of the power overhead and added latency. From a NoC perspective, if I’m trying to get data out of the memory, and I need to insert a block somewhere because I need to go through UCIe to another chip to get the memory that’s attached to the other chip, it adds cycles. Because of this, the role of the architect is growing in importance. You want to avoid conversions, both from a latency and a low power perspective. It’s just gates that don’t add anything. If only everybody would speak the same language.

Related Reading
Rethinking Memory (part 2 of above roundtable)
Von Neumann architecture is here to stay, but AI requires novel architectures and 3D structures create a need for new testing tools.
DRAM Choices Are Suddenly Much More Complicated
The number of options and tradeoffs is exploding as multiple flavors of DRAM are combined in a single design.
CXL: The Future Of Memory Interconnect?
Why this standard is gaining traction inside of data centers, and what issues still need to be solved.
SRAM In AI: The Future Of Memory
Why SRAM is viewed as a critical element in new and traditional compute architectures.



1 comments

Nicolas Dujarrier says:

What about the maturity of spintronics related technologies like Non Volatile Memory (NVM) (ex: SOT-MRAM, VG-SOT-MRAM or further out VCMA-MRAM) or probalistic computing accelerators based on stochastic Magnetic Tunnel Junctions (sMTJ) (p-bit computing) for AI that could help significantly improve energy efficiency and help trigger new innovative memory hierarchy ?

It seems that European semiconductor R&D center IMEC is working on some related concepts…

Leave a Reply


(Note: This name will be displayed publicly)