Memory’s Future Hinges On Reliability

Robust implementations are a major issue, particularly as memory density increases.

popularity

Experts at the Table: Semiconductor Engineering sat down to talk about the impact of power and heat on off-chip memory, and what can be done to optimize performance, with Frank Ferro, group director, product management at Cadence; Steven Woo, fellow and distinguished inventor at Rambus; Jongsin Yun, memory technologist at Siemens EDA; Randy White, memory solutions program manager at Keysight; and Frank Schirrmeister, vice president of solutions and business development at Arteris. What follows are excerpts of that conversation. Part one of this discussion is here. Part two is here.

[L-R]: Frank Ferro, Cadence; Steven Woo, Rambus; Jongsin Yun, Siemens EDA; Randy White, Keysight; and Frank Schirrmeister, Arteris.
[L-R]: Cadence’s Ferro, Rambus’ Woo, Siemens’ Yun, Keysight’s White, and Arteris’ Schirrmeister.

SE: What issues is the chip industry facing with future memory implementations?

Woo: As devices get smaller, and try to go faster, reliability becomes much more of a challenge. The ability to see what’s really going on in the silicon is important. We see a lot of evidence now in memory that it’s just becoming harder to increase the density, to shrink bit cells, and to go faster. One of the things we’re starting to deal with now is on-die ECC. And as the bit cells get smaller, it’s hard to manufacture them all with perfect reliability. We’re seeing all the memories adopting some form of on-die ECC to do correction.

At higher data rates the systems folks are looking at system-level ECC — especially in main memory, where there’s some system-level type of ECC to make up for errors that can happen on the links, or in more catastrophic situations, where lots of bits are starting to fail from one device. There are things that going on between neighboring bit cells that didn’t happen before, neighbor-disturb issues like row hammer and RowPress. We’re starting to see support inside the DRAM for dealing with those kinds of issues. On top of all these challenges, the systems people are asking us for more bandwidth, lower power, better latency, and better performance. There are other things that are going on that we’re having to deal with, too. They’re interesting problems, but they’re all solvable. We’ll have to continue to innovate in ways that allow us to do things like get better observability and improve our reliability, as well as just getting to better performance levels.

Yun: One of the big problems we have in the data center is silent data corruption. We test as much as we can to detect errors and fix that problem. However, we still have some hidden fails underneath, and they propagate to next level and then cause a big problem. This is not just related to the hardware. It also is related to software, and some of those software-related defects or faults require six months of testing. It is very problematic to detect all of those, so more people are using ECC in the data center to avoid those types of issues. Also, there’s also a lot of research to improve test quality. When you deal with test quality, that means you perform more and more tests, which will increase the test budget. That is another big issue we need to think about — the price of the entire system. There’s a lot of challenges going on in that area.

Ferro: ECC is a very important component for customers, regardless of which memory they’re talking about. We’ve got to support more reliability features. Fortunately, we have a good view of the market, because we see a lot of different customers and a lot of different architectures. We see customer interest in DDR5 is still very strong. HBM is very strong in the datacenter, but GDDR and LPDDR are actually architectures of interest. Right now, for AI training, HBM is certainly dominant, along with DDR5. But as HBM training matures, you’re going to see more of these models getting pushed out for inference. And then you have more costs, and performance goes down a little bit, and you have more of a cost concern. So we see LPDDR and GDDR as options. We’re seeing all of the above. There are lots of different architectures.

Fig. 1: DDR/LPDDR PHY and Controller System. Source: Cadence.

White: On reliability, an adjacent issue is security. In JEDEC there’s a data integrity workgroup that’s been formed to address row hammer. It gets a lot of buzz, and there’s arguments whether it’s a real issue or not, or is this more of an academic study that universities like to dig into. Whether it is an issue or not, I am happy to say that it has been addressed, and it’s the most elegant solution ever, which is to implement a counter in the memory so we can know how many times a row has been addressed and then we can adapt or modify our refresh management. An issue that has serious implications if this were exploited seems to have been resolved as long as software and BIOS engineers can implement this particular feature in their code. We should look out for security as part of the reliability theme. It’s definitely ongoing work. There will always be threats to memory as a target.

I also wanted to return to the question of whether there is a shift in the von Neumann architecture. I was just thinking about how AI/ML have evolved the compute conversation to talk about locality. I also wonder about quantum computing versus classical compute, and what are the implications there in terms of the overall architecture. I don’t think there’s a significant memory need or bandwidth requirement, but it’s another dimension that we should probably think about. How will quantum and classic computing co-exist in the next 10 to 20 years?


Fig. 2: Row hammer attack. Source: Rambus

Schirrmeister: I don’t have an answer to that, but to close up on the reliability and last thoughts questions, considering ECC, reliability, safety, and broader security is all important from a NoC perspective, and we are only one component. If you look at a system designer or an SoC designer, as part of this, the weakest link is the problem. All of them need to be addressed across memory, across NoC and the other components.

When I joined Arteris and learned more about the effects of data transport, memory storage, and what impact it has on architecture, I wondered why we haven’t figured out the memory wall yet. We have worked around it systemically by dealing with the proper architecture enhancements on how to use memory, how to parallelize it, and then do things like in-memory compute. There’s a lot of interesting work going on, but memory is certainly something that needs to find more breakthroughs, because at some point we’ll run out of architecture enhancements that work. So we need to work on the basics.

Part one of this discussion is here: The Future Of Memory.
Part two is here: Rethinking Memory.



Leave a Reply


(Note: This name will be displayed publicly)