The Uncertain Future Of In-Memory Compute

The answer may depend on whether SRAM can shrink further.


Experts at the Table — Part 2: Semiconductor Engineering sat down to talk about AI and the latest issues in SRAM with Tony Chan Carusone, chief technology officer at Alphawave Semi; Steve Roddy, chief marketing officer at Quadric; and Jongsin Yun, memory technologist at Siemens EDA. What follows are excerpts of that conversation. Part one of this conversation can be found here and part 3 is here.

L-R: Alphawave’s Chan Carusone; Quadric’s Roddy; Siemens EDA’s Yun.

SE: Will people be working on trying to shrink the size of SRAM, and is that even possible given the laws of physics?

Chan Carusone: There’s always been a promise that there will be improvement, but it’s been the trend that SRAM will scale more slowly than logic. That means either the architecture changes or the SRAM consumes a larger and larger fraction of a given chip. We may see both. You can change the hierarchy, change the location, change the types of memory you use to prevent bottlenecks. That’s one solution. But there’s going to be lots of R&D on technology, like bottoms-up solutions to squeeze SRAM.

Roddy: There’s only so much you can do with changing core technologies. In standard SoCs, it’s classic 6Ts SRAM, which it has been for 30 to 40 years. People have tried a variety of things like 3T cells, but there are reliability, manufacturability, and designability questions, such as how you test with it. Certain markets, like automotive, freak out when you get cells that are much more error prone or alpha-particle sensitive. Maybe you can’t do a giant inference machine in a car, where you have reliability and functional safety problems. All these things are going to have to be taken into consideration. It’s the variety of layers of memory that become the toolkit the architect needs to use.

SE: Why is SRAM shrinking more slowly than logic?

Yun: The SRAM shrink lags behind the logic shrink primarily due to tight design rules in the latest technology. In the past, we had separate design rules for SRAM, allowing us to shrink more than the logic transistor-based design. However, as we moved to smaller dimensions nodes, it became increasingly challenging to maintain this distinction. Now, SRAM is following more and more of the logic design rules and there are less significant benefits of shrinking the memory further more than the logic transistor-based design. In addition to that, the size of the memory is important because this design is repeated millions of times on chip, impacting the cost of the chip. However, when we migrated from lower technologies in recent years, the benefits were diluted because we would end up spending more money migrating to the lower technology node than the benefit of shrinking the SRAM memory size. This is the main challenge we face when striving to increase SRAM density in AI chips.

SE: Shrinking transistor sizes can lead to leakage currents. How are people coping with it?

Yun: The major leakage benefit for technology migration was from reducing the VDD level and incorporating new materials, such as a high-k material in transistor oxide to improve the gate leakage, which improves the power efficiency. However, VDD scaling has hit the saturation point near the 0.7 to 0.8 voltage range, meaning we no longer can gain additional benefits from voltage reduction, and other leakage levels also remain relatively unchanged. If we continue to increase the density of SRAM and keep migrating to newer technology to add more transistors into a chip, we need more power for the chip operation. For example, Lisa Su, CEO of AMD, forecast that we will use half of a nuclear power plant’s energy just for a single supercomputer operation in 2035. That’s a lot of energy, and we are heading in an unsustainable direction. Something needs to happen to improve the energy used in a chip. The recent AMD chip reduced the logic area and filled it with more cores, while keeping the memory density the same. This may reduce logic operation frequency by reducing the logic area. However, it also allows the system to accomplish a similar work load thorough extra throughput from doubling the number of cores, resulting in a modest tradeoff, but a substantial gain in power efficiency.

SE: What role will SRAM play in near-memory compute or in-memory compute? Will we see in-memory compute in the commercial market?

Roddy: There have been a couple of attempts by chip startups to commercialize analog in-memory compute, particularly for the multiplication functions. In machine learning, it’s a lot of matrix multiplication and convolutions. It’s easy to conceptualize by talking about an image. A 3 x 3 convolution is doing a compute around the three nearest neighbors. Thus, with 1 x 1, 3 x 3, 9 x 9, you grow the aperture of what you’re trying to compute. It lends itself nicely to the idea that you could do it in a memory cell. With analog, you’d have instant access and you can integrate voltage, etc. But practically speaking, none of those things have come to life. There’s been a lot of venture money, hundreds of millions of dollars, poured into solutions that have never seen the light of day, primarily because it becomes a partitioning problem. If you say, ‘I’m going to build some sort of odd, non-digital compute in the memory itself,’ by definition you have said, ‘I’m going to carve out a separate chip with separate technology, and my general compute engine is going to run a pure digital chip and some other engine is going to be over in this memory chip.’ Now you have a very hard partitioning of the algorithm, which creates strong limitations. You have to have this Goldilocks network, where compute can stay local to the analog chip before it has to go back to the general-purpose chip where the main code is to finish execution.

If you have a cell phone with an app processor from Qualcomm, a separate chip from somebody else, and you want to run a face beautification algorithm live during a zoom call, how do you do that? How do you synchronize the execution of the software? Algorithms get more complicated on an annual basis, and data scientists haven’t slowed down their innovation. Transformers, which are the new thing, like ChatGPT’s vision transformers, are incredibly complex with the amount of transfer back and forth. Let’s say you had deployed something with in-memory compute for convolutions. You’d never map a vision transformer to that because you’d run out of time waiting for data to go back and forth between the two different types of chips. So, pure in-memory computing in SRAM with something different, esoteric and partitioned? That’s never going to happen. If you could build it as a compiled SRAM that could be on the SoC, now you’re talking something different, but now you’ve got to have a 6T analog cell with some sort of other analog thing built in. It has to be isolated from noise from the big GPU shader engine you’re putting down right next to it, so that becomes a problem, too. How do you build the silicon such that the 10,000 MACs banging away over here don’t inject noise into the sensitive analog circuit that you’ve tried to compile? It seems to be an untenable problem. About $300 million of venture capital has gone down the drain, and no one’s managed to make anything in production quantities yet.

Chan Carusone: Because of the restrictions that Steve described, most of the rational interest is focused on using that kind of technology for some low-power or niche edge inference kind of applications. I wouldn’t doubt the potential impact. But now the key is finding an application for this technology where there’s enough volume, enough market potential to justify this quite tailored hardware solution for it. That’s why we’re seeing this idea hanging around for a long time, but are still waiting for the opportunity for it to have a big impact.

Yun: I agree. Many new technologies won’t see development unless there is a substantial demand from the market. Even if we have a promising technology ready to deploy, it won’t be put into action until we’ve addressed all the risk and received a demand to use it to generate revenue. In the case of compute-in-memory (CiM), we can reduce the data transfer because all calculation is happening in the same location. This means the data will stay there and be calculated without any data transfer. This translates to faster process speed and energy efficiency. However, to make this work, we will need to make various adjustments to the surrounding system to accommodate these new ways of handling data. To justify investing in such changes, there must be a strong demand. Also, there shouldn’t be an alternate solution with lower risk available, so we can confidently start working on it.

An example of moving one step in that direction utilizes DRAM. When we have a lot of cores connected in parallel in an AI chip, we need to bring wide bandwidth data to the processor to improve efficiency. So designers added a lot of DRAM connected to the AI chip to deliver the massive data. DRAM was chosen because it is more cost-effective to store the data in DRAM than SRAM. So whenever the number of cores is increased, it requires increasing the number of channels as well for the DRAM. And now we have thousands of cores in some AI chips. It demands more and more DRAM channels, which easily reaches several hundred pins, although it is physically impossible to put too many channels wired into one single chip. We have to resolve this bottleneck. Samsung suggested performing the near-memory calculations within the DRAM itself. They’ve added a MAC unit in the DRAM to perform the initial calculation. After that, the data is sent to the AI chip for the follow-up step. This approach is one step closer to near-memory computing. People find a new solution if they see a dead end. If there is a way to relax the issue with an existing configuration, people tend to stay with the existing setup to avoid the risks of moving to a new method.

Chan Carusone: The DRAM memory bottleneck is a key challenge to be addressed, and that’s why HBM is increasing in importance. There’s a good roadmap ahead for HBM to provide higher bandwidth memory interfaces. People also are talking about the potential to essentially use some kind of chiplet that’s an HBM-to-DDR translator. That might introduce another level of hierarchy in the memory, where you’ve got some HBM and maybe some DDR that’s a little further out. People are thinking about coming at this memory bottleneck problem in all different ways.

Roddy: People are even trying to tackle the memory bottleneck problem numerically at the data science level. Training obviously is super expensive, and if you want to train your 100 billion-parameter chatbot, that’s millions of dollars of compute time on your favorite cloud service. Folks have experimented with, if their calculations in training are floating point 32 (fp32), can they store out to DDR and a different format? You’ve got bfloat, fp8, and a variety of things they try and figure out. The simplest one for me a couple of years back was bfloat. You literally take an fp32 number, cut off below 16 bits of the mantissa, and just throw it away saying, ‘I don’t really need it. I’ll gain it back when I bring it back to train the next time.’ Quite simply, it’s like, ‘How do I cut the DDR traffic in half and speed up the overall training problem?’ That’s evolved into a whole bunch of other sorts of numerical issues, like fp8 with various esoteric formats. They’re all trying to meet the challenge of a firehose of data — zillions of images for full self-driving or language samples, or whatever it happens to be, that has to be moved from compute to memory, memory to compute. It’s a memory bottleneck and memory hierarchy problem, not a compute problem at this point.

SRAM In AI: The Future Of Memory Part 1 of the above conversation.
Why SRAM is viewed as a critical element in new and traditional compute architectures.

Leave a Reply

(Note: This name will be displayed publicly)