Rethinking Memory

Von Neumann architecture is here to stay, but AI novel architectures and 3D structures create a need for new testing tools.


Experts at the Table: Semiconductor Engineering sat down to talk about the path forward for memory in increasingly heterogeneous systems, with Frank Ferro, group director, product management at Cadence; Steven Woo, fellow and distinguished inventor at Rambus; Jongsin Yun, memory technologist at Siemens EDA; Randy White, memory solutions program manager at Keysight; and Frank Schirrmeister, vice president of solutions and business development at Arteris. What follows are excerpts of that conversation. Part one of this discussion can be found here.

[L-R]: Frank Ferro, Cadence; Steven Woo, Rambus; Jongsin Yun, Siemens EDA; Randy White, Keysight; and Frank Schirrmeister, Arteris.

[L-R]: Frank Ferro, Cadence; Steven Woo, Rambus; Jongsin Yun, Siemens EDA; Randy White, Keysight; and Frank Schirrmeister, Arteris

SE: As we struggle with AI/ML and power demands, what configurations need to be rethought? Will we see a shift away from Von Neumann architecture?

Woo: In terms of system architectures, there’s a bifurcation going on in the industry. The traditional applications that are the dominant workhorses, which we run in the cloud on x86-based servers, are not going away. There are decades of software that has been built up and evolved, and which will rely on that architecture to perform well. By contrast, AI/ML is a new class. People have rethought the architectures and built very domain-specific processors. We’re seeing that about two-thirds of the energy is spent just moving the data between a processor and an HBM device, while only about a third is spent in actually accessing the bits in the DRAM cores. The data movement is now much more challenging and expensive. We’re not going to get rid of memory. We need it because the datasets are getting larger. So the question is, ‘What’s the right way going forward?’ There’s been a lot of discussion about stacking. If we were to take that memory and put it directly on top of the processor, it does two things for you. First, bandwidth today is limited by the shorefront or the perimeter of the chip. That’s where the I/Os go. But if you were to stack it directly on top of the processor, now you can make use of the whole area of the chip for distributed interconnects, and you can get more of the bandwidth in the memory itself, and it can feed directly down into the processor. Links get a lot shorter, and the power efficiency probably goes up on the order of 5X to 6X. Second, the amount of bandwidth you can get because of that more area array interconnect to the memory goes up, as well, by a several integer factor. Doing those two things together can provide more bandwidth and make it more power-efficient. The industry evolves to whatever the needs are, and that’s definitely one way we’ll see memory systems start to evolve in the future to become more power efficient and provide more bandwidth.

Ferro: When I first started working on HBM back around 2016, some of the more advanced customers asked if it could be stacked. They’ve been looking at how to stack the DRAM on top for quite some time because there are clear advantages. From the physical layer, the PHY becomes basically negligible, which saves a lot of power and efficiency. But now you’ve got a several-100W processor that’s got a memory on top of it. The memory can’t take the heat. It’s probably the weakest link in the heat chain, which creates another challenge. There are benefits, but they still have to figure out how to deal with the thermals. There’s more incentive now to move that type of architecture forward, because it really does save you overall in terms of performance and power, and it will improve your compute efficiency. But there are some physical design challenges that have to be dealt with. As Steve was saying, we see all kinds of architectures that are coming out. I totally agree that the GPU/CPU architectures aren’t going anywhere, they’re still going to be dominant. At the same time, every company on the planet is trying to come up with a better mousetrap to do their AI. We see on-chip SRAM and combinations of high-bandwidth memory. LPDDR has been raising its head quite a bit these days in terms of how to take advantage of LPDDR in the data center because of the power. We’ve even seen GDDR being used in some AI inference applications, as well as all the old memory systems. They’re now trying to squeeze as many DDR5s on a footprint as possible. I’ve seen every architecture you can think of, whether it’s DDR, HBM, GDDR, or others. It depends on your processor core in terms of what your overall value add is, and then how you can break through your particular architecture. The memory system that goes with it, so you can sculpture your CPU and your memory architecture, depending on what’s available.

Yun: Another issue is the non-volatility. If the AI has to deal with the power interval in between running an IoT-based AI, for example, then we need a lot of power off and on, and all this information for the AI training has to rotate again and again. If we have some type of solutions where we can store those weights into the chip so we don’t have to always move back and forth for the same weight, then it will be a lot of power savings, especially for IoT-based AI. There will be another solution to help those power demands.

Schirrmeister: What I find fascinating, from a NoC perspective, is where you have to optimize these paths from a processor going through a NoC, accessing a memory interface with a controller potentially going through UCIe to pass a chiplet to another chiplet, which then has memory in it. It’s not that Von Neumann architectures are dead. But there are so many variations now, depending on the workload you’d want to compute. They need to be considered in the context of memory, and memory is only one aspect. Where you get the data from the data locality, how is it arranged in this DRAM? We are working through all these things, such as performance analysis of memories and then optimizing the system architecture on it. It is spurring a lot of innovation for new architectures, which I never thought of when I was at university learning about Von Neumann. At the extreme other end, you have things like meshes. There’s a whole lot more architectures now in between to be considered, and it’s driven by the memory bandwidth, compute capabilities, and so forth, not growing at the same rate.

White: There’s a trend involving disaggregated compute or distributed computing, which means the architect needs to have more tools at their disposal. The memory hierarchy has expanded. There are semantics included, as well as CXL and different hybrid memories, that are available for flash and in DRAM. A parallel application to the data center is automotive. Automotive always had this sensor compute with ECUs (electronic control units). I’m fascinated by how it’s evolved to the data center. Fast forward, and today we have distributed compute nodes, called domain controllers. It’s the same thing. It’s trying to address that maybe power’s not as big of a deal because the scale of computers is not as large, but latency is certainly a big deal with automotive. ADAS needs super-high bandwidth, and you’ve got different tradeoffs. And then you’ve got more mechanical sensors, but similar constraints in a data center. You’ve got cold storage that doesn’t need to be low latency, and then you’ve got other high bandwidth applications. It’s fascinating to see how much the tools and the options for the architect have evolved. The industry has done a really good job of responding, and all of us provide various solutions that feed into the into the market.

SE: How have memory design tools evolved?

Schirrmeister: When I started with my first couple of chips in the ’90s, the most used system tool was Excel. Since then, I’ve always hoped it might break at one point for the things we do at system-level, memory, bandwidth analysis, and so forth. This impacted my teams quite a bit. At the time, it was very advanced stuff. But to Randy’s point, now certain complex things need to be simulated at a level of fidelity that was previously not possible without the compute. To give an example, assuming a certain latency for a DRAM access can lead to bad architecture decisions and potentially incorrectly designing data transport architectures on chip. The flip side is also true. If you always assume the worst case, then you will over-design the architecture. Having tools perform the DRAM and performance analysis, and having the proper models available for the controllers allows an architect to simulate all of it, that’s a fascinating environment to be in. My hope from the ’90s that Excel might at one point break as a system level tool might actually come true, because certain of the dynamic affects you can’t do in Excel anymore because you need to simulate them out — especially when you throw in a die-to-die interface with PHY characteristics, and then link layer characteristics like all the checking whether everything was correct and potentially resending data. Not having those simulations done will result in sub-optimal architecture.

Ferro: The first step in most evaluations we do is to give them the memory testbench to start looking at the DRAM efficiency. That’s a huge step, even doing things as simple as running local tools to do DRAM simulation, but then going into full-blown simulations. We see more customers asking for that type of simulation. Making sure your DRAM efficiency is up in the high 90s is a very important first step in any evaluation.

Woo: Part of why you see the rise of full system simulation tools is that DRAMs have become much more complicated. It’s very difficult now to be even in the bar for some of these complex workloads by using simple tools like Excel. If you look at the datasheet for DRAM in the ’90s, those data sheets were like 40 pages. Now they’re hundreds of pages. That just speaks of the complexity of the device in order to get the high bandwidths out. You couple that with the fact that memory is such a driver in system cost, as well as bandwidth and latency related to performance of the processor. It’s also a big driver in power, so that you do need to simulate at a much more detailed level now. In terms of tool flow, system architects understand that memory is a huge driver. So the tools need to be more sophisticated, and they need to interface to other tools very well so that the system architect gets the best global view of what’s going on — especially with how memory is impacting the system.

Yun: As we move to the AI era, a lot of multi-core systems are used, but we don’t know which data goes where. It’s also going more parallel to the chip. The size of the memory is a lot bigger. If we use the ChatGPT-type of AI, then the data handling for the models requires about 350MB of data, which is a huge amount of data just for a weight, and the actual input/output is much larger. That increase in the amount of data required means there are a lot of probabilistic effects we haven’t seen before. It’s an extremely challenging test to see all the errors related to this large amount of memory. And ECC is used everywhere, even in SRAM, which didn’t traditionally use ECC, but now it’s very common for the largest systems. Testing for all of that is very challenging and needs to be supported by EDA solutions to test all those different conditions.

SE: What challenges do engineering teams face on a day-to-day basis?

White: On any given day, you will find me in the lab. I roll up my sleeves and I’ve got my hands dirty, poking wires, soldering, and whatnot. I think a lot about post-silicon validation. We talked about early simulation and on-die tools — BiST, and things like that. At the end of the day, before we ship, we want to do some form of system validation or device-level tests. We talked about how to overcome the memory wall. We co-locate memory, HBM, things like that. If we look at the evolution of packaging technology, we started out with leaded packages. They weren’t very good for signal integrity. Decades later, we moved to optimized signal integrity, like ball grid arrays (BGAs). We couldn’t access that, which meant you couldn’t test it. So we came up with this concept called a device interposer — a BGA interposer — and that allowed us to sandwich a special fixture that routed signals out. Then we could connect it to the test equipment. Fast forward to today, and now we have HBM and chiplets. How do I sandwich my fixture in between on the silicon interposer? We can’t, and that’s the struggle. It’s a challenge that keeps me up at night. How do we perform failure analysis in the field with an OEM or system customer, where they’re not getting the 90% efficiency. There’s more errors in the link, they can’t initialize properly, and training is not working. Is it a system integrity problem?

Schirrmeister: Wouldn’t you rather do this from home with a virtual interface than walking to the lab? Isn’t the answer more analytics you build into the chip? With chiplets, we integrate everything even further. Getting your soldering iron in there is not really an option, so there needs to be a way for on-chip analytics. We have the same problem for the NoC. People look at the NoC, and you send the data and then it’s gone. We need the analytics to put in there so that people can do debug, and that extends to the manufacturing level, so that you finally can work from home and do it all based on chip analytics.

Ferro: Especially with high bandwidth memory, you can’t physically get inside there. When we license the PHY we also have a product that goes with that so you can put eyes on every single one of those 1,024 bits. You can start reading and writing DRAM from the tool so you don’t have to physically get in there. I like the interposer idea. We do bring some pins out of the interposer during testing, which you can’t do in system. It’s really a challenge to get into these 3D systems. Even from a design tool flow standpoint, it seems like most companies do their own individual flow on a lot of these 2.5D tools. We’re starting to put together a more standardized way to build a 2.5D system, from signal integrity, power, the whole entire flow.

White: As things move on-die, I hope we can still maintain the same level of accuracy. I’m on the UCIe form factor compliance group. I’m looking at how to characterize a known good die, a golden die. Eventually, this is going to take a lot more time, but we’re going to find a happy medium between the performance and accuracy of the testing that we need, and the flexibility that’s built in.

Schirrmeister: If I look into chiplets and their adoption in a more open production environment, testing is one of the bigger challenges in the way of making it work right. If I’m a big company and I control all sides of it, then I can constrain things appropriately so that testing and so forth becomes feasible. If I want to go to the UCIe slogan that UCI is only one letter away from PCI, and I imagine a future where UCIe assembly becomes, from a manufacturing perspective, like PCI slots in a PC today, then the testing aspects for that are really challenging. We need to find a solution. There’s lots of work to do.

Related Articles
The Future Of Memory (Part 1 of above roundable)
From attempts to resolve thermal and power issues to the roles of CXL and UCIe, the future holds a number of opportunities for memory.

Leave a Reply

(Note: This name will be displayed publicly)