While it will remain a workhorse memory, using SRAM at advanced nodes requires new approaches.
The inability of SRAM to scale has challenged power and performance goals forcing the design ecosystem to come up with strategies that range from hardware innovations to re-thinking design layouts. At the same time, despite the age of its initial design and its current scaling limitations, SRAM has become the workhorse memory for AI.
SRAM, and its slightly younger cousin DRAM, have always come with tradeoffs. SRAM is most commonly configured with six transistors, which gives it faster access times than DRAM, but at the price of using more energy for reads and writes. By contrast, DRAM uses a one transistor/one capacitor design, which makes it cheaper. But DRAM compromises performance because capacitors need to be refreshed due to charge leakage — and sometimes will do so spontaneously when the memory heats up. Thus, over 60 years since its introduction, SRAM has remained the memory of choice in applications where lower latency and reliability are prioritized.
Fig. 1: SRAM cell sizes shrink more slowly than processes, causing f 2 to balloon. Source: Objective Analysis/Emerging Memory Technologies report
Indeed, for AI/ML applications, SRAM is more than holding its own. “SRAM is essential for AI — and embedded SRAM, in particular. It’s the highest-performance memory, and you can integrate it directly alongside the high-density logic. For those reasons alone, it’s important,” said Tony Chan Carusone, CTO of Alphawave Semi.
Power and performance challenges
But when it comes to keeping pace with CMOS scaling, SRAM has fallen flat, with consequences for power and performance. “In traditional scaling of planar devices, gate length and gate oxide thickness were scaled down together to improve performance and control of the short-channel effect. A thinner oxide enabled the performance gain in lower VDD level, which is advantageous for SRAM in reducing both leakage and dynamic power,” said Jongsin Yun, memory technologist at Siemens EDA. “However, in recent technology node migrations, we’ve barely seen further scaling oxide or VDD levels. Moreover, the geometric shrinkage of transistors results in thinner metal interconnects, leading to increased parasitic resistance and, consequently, more power loss, and RC delay. As AI design increasingly demands more internal memory access, it has become a significant challenge for SRAM to further scale its power and performance benefits in technology node migration.”
These issues, along with SRAM’s high cost, inevitably lead to performance compromises. Instead of relying entirely on SRAM, there is a whole hierarchy of memory/storage options, starting with off-chip DRAM, which is available in various speeds and architectural configurations.
“If you can’t get enough SRAM to meet the data storage needs of a processor core, then the core will end up having to move data from further away,” said Steve Woo, fellow and distinguished inventor at Rambus. “Additional power is required to move data between SRAM and DRAM, so the system consumes more power. And it takes longer to access that data from DRAM, so performance will go down.”
The situation likely will not improve, and may even get worse, at each new node.
“Looking ahead toward nanosheets, very little dimensional scaling is expected for SRAM,” said Geert Hellings, program director DTCO at imec. “One could say that replacing fins (5nm wide) by nanosheets (~15nm wide) would increase the SRAM bit cell height by 40nm (4 fins per), if all other process/layout margins remain constant. Obviously, this would not be a very good value proposition. Therefore, ‘flanking’ improvements in process/layout margins are expected to offset this. Nevertheless, it’s a bit of an uphill battle scaling SRAM from finFETs to nanosheets.”
Flex Logix, which has worked at several of the lowest nodes, including TSMC’s N7 and N5, and recently received the PDK for Intel’s 1.8Å node, is very familiar with the challenges. “Our customers who work on advanced nodes are all complaining that the logic is scaling better and faster than the SRAM,” said Flex Logix CEO Geoffrey Tate. “That’s an issue for processors, because it’s unusual to have cache memories that are bigger than the whole processor. But if you were to put it off-chip, your performance would drop like a rock.”
TSMC is hiring more memory designers to improve SRAM density, but whether they can eke more out of SRAM remains to be seen. “Sometimes you can make things better by applying more people, but only up to a point,” said Tate. “Over time, customers will need to think about architectures that don’t use SRAM as intensively as they do now.”
In fact, as far back as 20nm, SRAM’s inability to scale commensurately with logic portended that there would be power and performance challenges, when on-chip memory could become bigger than the chip itself. In response to such issues, both system designers and hardware developers are applying new solutions and developing new techniques.
Along those lines, AMD has taken a different approach. “They’ve introduced a technology called 3D V-Cache, which allows additional SRAM cache memory on a separate chip to be stacked on top of processors, increasing the amount of cache available to the processor cores,” said Rambus’ Woo. “The additional chip adds cost, but allows access to additional SRAM. Another strategy is to have multiple levels of cache. Processor cores can have private (unshared) level 1 and level 2 caches that only they can access, and a much larger last-level cache (LLC) that is shared between the processor cores. With processors having so many cores, sharing the LLC allows some cores to use more capacity at times, and some to use less, so that the total capacity gets used more efficiently across all processor cores.”
Error correction
Scaling adds reliability issues, as well. “SRAM traditionally has used more aggressive, smaller sizes than logic cells, but it’s not like traditional logic gates where there’s never a contention and you’re always writing a new value in,” said Flex Logix CTO Cheng Wang. “You have to overcome the current value. But when you’re not writing to it, you want it to hold its value very strongly. So you have a dilemma, where it cannot be too weak to hold its value during normal operation, but when you write to it you want it to be weaker. Because SRAM only has six transistors, there’s not a whole lot of gates that you can add to let it be weaker when you’re writing, and amplified when you’re not writing. You also cannot make the SRAMs too small, because that could cause single-event upsets (SEUs) from issues such as alpha particles, where the energy of the ions overwhelms the energy in the SRAM cell, which happens more as SRAM shrinks.”
Consequently, error correction likely will become a common requirement, especially for automotive devices, according to Wang.
SEUs have become such a problem at lower nodes that radiation hardening techniques, previously used only for mil/aero applications, are being used for SRAMs at N5 and below, said Tate. However, because rad-hardening can add 25% to 50% to the cost, it’s only likely to be used for devices such as pacemakers, where no one can afford to wait for a reboot.
“It may be that in 10 years, everything has to be designed rad-hard. You just can’t keep making the memory elements smaller and smaller,” said Tate. “We’re not getting rid of alpha particles.”
Basic approaches: Tradeoffs
This is causing a lot of changes on the design side. “Everyone’s trying to use fewer SRAMs on chip, because they’re not getting smaller,” Wang said. “But you use SRAMs for their bandwidth, so as long as the bandwidth is there. As your chips get bigger, high-capacity bandwidth memory will get pushed off-chip to DDR, but you will still have chunks of smaller-high bandwidth memories that are SRAM.”
Another approach designers take is to use only single-core memories where possible. “In the older process nodes, there’s much more likelihood of doing dual-core memory when we write register files,” he said. “But all those add area, as well. So in lower nodes, designers try to make everything operate from a single port in memory, because those are the smallest, densest full-power options available. They’re not necessarily moving away from SRAM, but they’re trying to use single-core memory whenever possible. They’re trying to use smaller memories, and choose SRAM for the available bandwidth, not really for large storage. Large storage either gets moved to DRAM, if you can afford the latency, or it gets moved to HBM, if you can afford the cost.”
Alternative approaches: New architectures
To continuously improve the power performance of SRAM, numerous updates that extend beyond the bit cell design have been evaluated and applied, including additional support circuits in the SRAM periphery design, according to Yun.
“The SRAM and periphery no longer share their power. Instead, a dual power rail is employed to individually utilize the most efficient voltage level,” said Siemens’ Yun. “In certain designs, SRAM can enter a sleep mode, applying bare minimum voltage needed to retain data until its next access from the CPU. This presents significant power benefits, since leakage current is exponentially correlated with VDD. Some SRAM designs incorporate extra circuits to address operational weak points, aiming to improve the minimum operating voltage.”
As an example, high-density (HD) SRAM cells can achieve the smallest geometry by using a single-fin transistor for all 6 transistors. However, the HD cell faces challenges in low-voltage operation due to contention issues between same-size pull-up (PU) and pass-gate (PG) transistors during write operations.
“In an SRAM assist circuit, such as negative bit-line, transient voltage collapse techniques are widely adopted to alleviate these issues and enhance lower voltage operation,” said Yun. “To mitigate the parasitic resistance effect, the latest bit cell designs utilize double- or triple-track metal wires as merged bitline (BL) or wordline (WL). The flying BL method selectively connects between metal track based on operation, lowering effective resistance and balanced discharge rates between the top and bottom of the array. In ongoing development, a buried power rail is being explored to further reduce wiring resistance. This involves placing all power rails beneath the transistors, alleviating signal path congestion above the transistor.”
Other memories, other structures
New embedded memory types are often brought up as SRAM replacements, but each has its own set of issues. “The leading contenders, MRAM and ReRAM, take just one single transistor area,” said Yun. “While it’s larger than the transistors in an SRAM, their overall cell size is still about one-third of SRAM, with a finished macro-size target, including peripheral circuit, that is about half the size of an SRAM. There is a clear size advantage, but the performance of the write speed is still far slower than the SRAM. There are a couple of promising write speed and endurance achievements in labs, but the high-speed MRAM development schedule has been extended subsequent to the production of eflash replacement MRAM for automotive. The size advantage for L3 cache replacement is certainly worth considering, but there must be a preceding ramp-up in the production of the eflash type of MRAM.”
If physics won’t permit smaller SRAM, the alternatives will require rethinking architectures and embracing chiplets. “If SRAM does not scale in N3 or N2, then one could combine a more advanced die for the logic with an SRAM die fabricated in an older technology,” said imec’s Hellings. “Such an approach would benefit from improved PPA for the logic while using a cost-effective (older, probably higher yield and less expensive) technology node for the SRAM. In principle, AMD’s V-cache-based systems could see an extension where only the logic die is moved to the next node. Both dies then need to be combined using 3D integration or a chiplet approach (2.5D).”
Scott Hanson, CTO of Ambiq, noted that a chiplet solution fits right into the ongoing integration revolution. “Analog circuits stopped scaling long ago and, with a few exceptions, don’t benefit greatly from scaling. Memories of all types, from DRAM to SRAM to NVM, prefer to be manufactured at different nodes for power, performance, and cost reasons. Logic prefers to be manufactured at the smallest node that still meets cost and leakage requirements. With multi-die integration, we manufacture each circuit in the ‘ideal’ technology node and then combine the dies into a single package. Many have heard about this in the mobile and data center space, but it’s also happening quickly in the endpoint AI and IoT space.”
In limited circumstances, system technology co-optimization (STCO) also could help. “For some applications, on-chip cache is in principle not needed,” said Hellings. “For example, in AI training, the training data is used only once, while model parameters should be readily accessible on-chip. Software and chip architecture hooks that facilitate such one-time data movements, bypassing the cache hierarchy, have a lot of potential.”
All of this has sparked an interest in new layouts and interconnect protocols, like both UCIe and CXL. “Memory scales with compute when you have a larger AI workload, but if one of those components is scaling a little bit faster than the other, you get different bottlenecks according to how the system is designed,” said Ron Lowman, strategic marketing manager at Synopsys. “The AI workloads have dramatically increased the number of arrays of processors required. They’ve even pushed the limits of the reticle size of dies, so now you have a need for high-speed interconnects like UCIe for die-to-die systems, which means that multi-die systems are inevitable to handle the AI workloads.”
A new stack to solve the problem
Winbond has re-thought the memory architecture with its CUBE stack (customized ultra-bandwidth elements). “We are using DRAM as a memory cell, but also doing 3D stacking through vias,” explained Omar Ma, DRAM marketing manager for Winbond. “Basically, you can provide a connection from the bottom substrate all the way through the SoC die. It’s more cost-effective because DRAM does not use the six transistors of SRAM.”
CUBE can provide enough high density to replace SRAM up through Level 3 cache. “In order to reach certain bandwidth requirements, there are only two choices — increase the clock speed or increase the number of I/Os,” Ma explained. “With CUBE, you can increase them as much as you want, and at the same time reduce the clock. That brings a lot of benefits at the system level, including reduced need for power.” CUBE is currently in prototype, but is anticipated to be in production in Q4 2024 or early 2025.
Conclusion
In the short term, pragmatism likely will win out over drastic design changes. “This is going to be incremental,” said Flex Logix’s Tate. “It’s not going to be dramatic. When designers are talking about how big a cache they should have, it’s going to be a balance, as always, between performance and price. If the price of SRAM goes up, they’ll decide they can live with less than they used to, and they’ll pay for it with some performance loss somewhere else. Maybe they’ll make it up by having more DRAM bandwidth. For now, it’ll be those kinds of incremental tradeoffs. You’re not going to see radically different architectures anytime soon. But if the trend continues, that will cause people to think about radically different approaches.”
As for SRAM being completely replaced, that seems unlikely, at least in the near term. “A couple of years ago, Intel demonstrated using a ferroelectric memory for a cache,” said Jim Handy, general director of Objective Analysis. “They said it was a DRAM, but in all honesty, it was an FRAM. They said the advantage was they were able to use 3D NAND technology to get it really tight. In other words, they showed a tiny space that had an awful lot of memory. It’s very likely that one of those types of research efforts — either like what Intel demonstrated or another approach like MRAM — will wind up taking SRAM’s place, but it’s probably not going to happen anytime soon.”
When it does happen, Handy expects it to result in both architecture and OS software changes. “It’s unlikely that you’d see the same processor with both an SRAM cache and a ferroelectric cache because the software is going to have to undergo some changes to be able to take advantage of that,” he said. “In addition, the cache is going to be structured a little bit differently. It’s likely that the primary cache will shrink a little bit and the secondary cache will get phenomenally large. At some point, the last processor to have an SRAM cache will come out. The next processor will have a ferroelectric or an MRAM cache, or something like that, along with substantial changes to the software to make that configuration work better.”
Three-Part Expert Discussion
SRAM In AI: The Future Of Memory
Why SRAM is viewed as a critical element in new and traditional compute architectures.
The Uncertain Future Of In-Memory Compute
The answer may depend on whether SRAM can shrink further.
SRAM’s Role In Emerging Memories
Tools and optimizations are needed for SRAM to play a role in AI hardware; other memories make inroads.
Leave a Reply