Tools and optimizations are needed for SRAM to play a role in AI hardware; other memories make inroads.
Experts at the Table — Part 3: Semiconductor Engineering sat down to talk about AI, the latest issues in SRAM, and the potential impact of new types of memory, with Tony Chan Carusone, CTO at Alphawave Semi; Steve Roddy, chief marketing officer at Quadric; and Jongsin Yun, memory technologist at Siemens EDA. What follows are excerpts of that conversation. Part one of this conversation can be found here. Part two is here.
[L-R] Alphawave’s Chan Carusone; Quadric’s Roddy; Siemens’ Yun.
SE: What role will SRAM play in the world of emerging memories?
Chan Carusone: Emerging memories co-exist in the memory hierarchy, with SRAM still providing fast access times near to the compute layer, as well. The others generally provide potentially higher storage density, but with longer latency. They’ll find their slot in there, but not displace the key role of SRAM.
Yun: SRAM is a first choice of memory because it is embedded in chips for instant access. Other external memory, such as DRAM or flash, requires going through additional protocols, and that requires a multiple of clocks and energy use. Embedded memory, like MRAM or ReRAM, can be instantly accessible like SRAM so that one can save power and get the benefit of speed. Within the chip, we can have access to both the SRAM and these embedded memory types. Comparing MRAM or ReRAM to SRAM, MRAM and ReRAM use just one single transistor, whereas SRAM uses six transistors, making them significantly smaller. Although they do require a larger transistor, the overall size is still a 1/3 of SRAM. Considering all of the surrounding peripheral components, we are aiming for a rule of thumb where the complete macro size is roughly half of the SRAM. There is a clear size advantage, but the performance of the write speed is still far slower than the SRAM. There are a couple of promising write speed achievements in labs, but we don’t know when we’ll have matured MRAM for L3 cache replacement. Nevertheless, the size advantage and the benefit of on-chip memory are certainly worth considering.
Roddy: This fits perfectly into the notion of how the ahead-of-time graph compilation works today, in the sense that every layer of memory has a cost function from a performance standpoint — effectively, latency. If I’m scheduling an operation and I know it’s in registers or local SRAM, I know it’s one or two clock cycles away. If it’s going to be in something like DDR off-chip, there’s a cost function since it’s going to take me four or 500 cycles of latency before I get back to that byte of data I’m accessing. If it’s in cold storage, like flash, it’s going to be thousands of cycles. And if there’s something new that comes along, for example eMRAM, that’s in between one of those layers, that just gets added into the compilation process and the tradeoff of where I store things versus how often I access them, and what the penalties are for reads and writes. That’s what the compiler teams do all day, whether it’s a GPU or CPU or an NPU. It’s all about scheduling ahead of time, fetching data, and sequencing things to keep the multiply accumulate units fed properly. Same with trainings versus inference. You’ve just laid out a bunch of MACs that you want to keep going.
SE: Are the tools keeping up? What needs to change in order to meet all of the trends you’ve been discussing?
Yun: All these new emerging technologies will need a lot of new stuff, new developments, tool innovations. For an EDA vendor, it would be best if there were standards that help unify the industry. However, we don’t yet know what will become the standard, if any at all. Therefore, we are working closely with our partners, collaborating on the right improvements of the tool, helping them with their implementations. For example, for MRAMs, there’s a lot of analog circuitry operation necessary just to make these memories work — or work better. We developed an implementation-friendly, fully automated DFT solution to adjust those analog parameters. For such tool improvements, we must collaborate very closely with our partners. From this, we are working toward broader, more general solutions that would work for a large cross-section of end users, making the tool more universal.
Roddy: One area of tool improvement, which is just starting, is a bridge between the world of the data scientists and the computer architects, whether they be data center chip builders or edge chip builders. The data scientists think at a much higher level of abstraction. They’re doing mathematics. It’s all about inventing new operators, solving problems, and gaining better accuracy. They’re not thinking about the constraints of execution. However, there aren’t good feedback mechanisms today in the mathematical and training tools to give direction to the person building models, to help them realize that what they’re attempting will take two months of training time and cost the company $10 million of the meter running in Amazon. Or they’ve trained an interesting model, but it’s impossible to run on the platform that they’re eventually targeting because it’s too big. They might get one inference per second, and that’s not good enough. How do you constrain what is done by the data scientists to fit better to the silicon? How do you design silicon for what’s coming in the future from the world of data science? There’s a lot of opportunity there in terms of next-generation integration of layers. The skill sets of these people are dramatically different. Think about how many people on LinkedIn have a background in data science, and a background in memory design — probably two in the world. The Venn diagrams are separate. Advancements in the integration of these skillsets will be key to making this go from invention to optimization to deployment a lot faster.
SE: Does this need to be optimized on the hardware side or the algorithms side?
Chan Carusone: Both. We need co-optimization across the whole stack, all the way from the algorithms down to device technology optimizations. There’s a lot of interest in being able to properly design and validate multi-die systems that encompass a lot of things like design space exploration and architecture design, recognizing the end workload that’s going to be running on that architecture. You’ve got the challenge of synthesis of the logic, memory, etc., across multiple dies, across multiple technologies, and being able to verify that. There’s interest from all sides for everyone to develop open ecosystems so that people can do this kind of co-optimization or design space exploration tool stack, and get to the point in the future where you can take chips from multiple companies that are pre-validated. Then, with only small changes, they can mix and match them quickly to address emerging needs and make efficient bespoke solutions without incurring extra costs, and without using the most advanced technology when you don’t need to. It’s a challenge for tools, but there’s tons of emphasis on it. Even in conversations I have about the CHIPS Act, everyone wants to see this.
Yun: A recent evaluation of a server machine field test revealed significant issues of silent data corruption. When it comes to diagnosing the problem, the physical failure analysis of individual chips takes a relatively short time, just a few minutes to a few hours. However, when it merges with software evaluation for the entire system, the diagnosis takes six months or more to identify some fault types. The software update may be relatively quick, but pinpointing a problem that occurs within the system due to a combination of the weak circuit transient and the software behavior is extremely challenging. Therefore, it is crucial for the software team, DFT team, and manufacturing team to collaborate and jointly determine the root cause of the problem, then find a solution to screen them in the manufacturing flow.
SE: Is it realistic to think we can actually replace current memory types?
Yun: There’s a lot of discussion about whether we can replace a current memory type with a new one. A complete replacement of current memory types with new ones is difficult to achieve in the near future. The current hierarchical memory structure will be used in most architectures as a best-known way. While some alternate ways of configuring the hierarchy in the specific chip design and specific configurations will keep coming to evaluate benefits in specific application, the new configurations will first fill the gap of current memory hierarchy between main memory and cache, and between main memory and storage memory. Although there is some promising performance achievement by some labs, the alternate memories are not yet at the level to consider full replacement of SRAM, so SRAM replacement in favor of some alternate memory won’t be happen in few years. What we have is the current best solution, and it will keep going for a while.
Chan Carusone: The hardware solutions may look different, but you will still be able to map it onto some kind of memory hierarchy. We’re working hard to come up with new connectivity solutions, with lower power and lower latency. That may change the tradeoff, so the solution may look different, but you’ll still be able to describe it in terms of some hierarchy. SRAM will play an important role somewhere in that hierarchy.
Roddy: The mix will change, but SRAM is going to be a key first-level memory going forward for decades, probably as it has been for the last four or five decades.
SE: What else can be expected for the future of SRAM?
Roddy: It’s a very dynamic industry. It’s hard to sit here and even predict what’s going to happen in 2024. The size of things the network people tried to train two and three years ago is radically different from what it is today. The good news, which is like the Full Employment Act for engineers, is that it’s changing rapidly. In three years, we’ll probably have another roundtable like this, and we’ll be amazed at the progress.
Chan Carusone: SRAM’s critical role in this AI gold rush is will ensure that that’s the case.
Key characteristics of SRAM is to ease of use and implementation in designs either “in-chip” (embedded in SOC) or “off-chip” (externally interfaced) and faster read access time that helps for all data computing /training purposed in AI.
Any advantage with come up with a challenge, which in SRAM case is the architecture manufacturing cost is higher with SRAMs that impact the size of embedded Die and thus impacts yield and cost.