Is In-Memory Compute Still Alive?

It hasn’t achieved commercial success, but there is still plenty of development happening; analog IMC is getting a second chance.

popularity

In-memory computing (IMC) has had a rough go, with the most visible attempt at commercialization falling short. And while some companies have pivoted to digital and others have outright abandoned the technology, developers are still trying to make analog IMC a success.

There is disagreement regarding the benefits of IMC (also called compute-in-memory, or CIM). Some say it’s all about reducing data movement, a huge component of AI energy consumption. “It’s easy to put the MACs [multiply/accumulate circuits] down,” said Gordon Cooper, product manager for ARC AI processors at Synopsys. “It’s a lot harder to feed them and make sure data is flowing through them efficiently.”

Other companies focus on computing power, and each believes what it’s developing solves the most important of those two, if not both. However, optimized circuits that can both reduce data movement and perform low-power AI calculations while balancing cost and manufacturability remain elusive.

“Data movement is the key problem, both for performance and for power,” said Steven Woo, fellow and distinguished inventor at Rambus. “There’s no shortage of data in the world, and especially with these large AI models, the training sets are enormous.”

Which is the best route to take to address this isn’t clear, but IMC is one possible option. To anyone unaware of the stealth startup efforts underway, the topic would appear to have retreated to research labs, where much work is underway. “I feel like we’re not quite out of the research phase,” said Frank Ferro, product marketing group director at Cadence.

In fact, IMC isn’t even on the radar for most designers. “We don’t see it at our customers’ sites,” said Nigel Drego, co-founder and CTO at Quadric.

Nevertheless, new offerings and approaches are attempting to change this.

More than one meaning
The term “in-memory compute” and its variants mean different things to different companies. When motivated by keeping data movement to a minimum, it’s closely related to the notion of “at-memory” or “near-memory” computing. In those cases, it’s about small blocks of SRAM near where computing happens. Using such memory still requires data movement, but the distance is short as compared with saving to DRAM.

The notion of “in memory” takes this farther and turns it on its head. Those prior approaches are about putting memory near where the computing is. IMC is more about putting computing where the memory is. One big distinction between flavors of IMC is whether the computing happens inside the memory array but outside the memory cells, or whether those cells perform the computing themselves.

Another distinction is the nature of the computing — digital or analog. Digital IMC tends to be of the type with a few digital gates sprinkled liberally throughout the array. “You interleave memory cells with a computing element that does the multiplication and another that does the accumulation, and you surround all of this with plenty of other digital logic to do all the other operations,” said Fabrizio del Maffeo, CEO of Axelera.

The idea behind digital IMC isn’t complicated — it merely moves digital arithmetic circuits from one place to another. That’s not to suggest that it’s easy, however. Building efficient circuits and tools still requires plenty of work. It’s just not as fraught as analog is.

Analog computing is typically performed by treating the memory cell as having variable contents that can be measured by sensing the current that flows through it. Word lines take on real values. In the best-known type of implementation, each cell at the intersection of a word line and a bit line effectively multiplies the input voltage by the cell conductance, which is set by the stored cell value. By allowing multiple word lines to be active at the same time, the sum of each of those multiplied currents becomes the resulting sum of products. All cells on a bit line can perform multiplication in parallel.

“[A flash IMC solution] is based on putting hundreds of millions of flash cells on a chip so that we can do all of the work in situ,” said Richard Terrill, vice president of strategy and business development at Sagence.

Fig. 1: Classic flash-based IMC architecture. Digital inputs are converted to analog voltages on word lines. Unlike a memory, multiple word lines can be active at the same time. All the cells along the bit line provide the multiplication of input voltage times the flash-cell conductance, which is established by the stored weight. Each cell on the bit line contributes current according to the cell currents, and the sense amp sums all those currents to provide accumulation. Results must then be digitized and sent through other circuits such as activation functions before either being routed back for another layer or sent out as a complete result. Source: Bryon Moyer/Semiconductor Engineering

One limitation of this technique is that it requires integer data. That’s natural for vision, but attention-based networks such as large-language models (LLMs) more often employ floating-point data, putting such applications out of reach for this architecture.

What problem are we solving?
The problems that IMC addresses are not well-defined or agreed upon. While all concur that lower power is the ultimate goal, what’s in dispute is whether the main issue is the cost of moving data or the cost of computing.

Digital approaches tend to focus on the cost of moving data since computing power will be more or less the same, whether the digital circuits are inside or outside the memory array. This attempts to address memory bandwidth issues. “We are getting our heads beat every day by hyperscalers needing more bandwidth,” observed Cadence’s Ferro.

Sharad Chole, chief scientist and co-founder at Expedera, agreed. “The bottleneck is no longer compute or the memory, but the bandwidth between compute and memory,” he said.

With digital IMC, nothing significant changes about how the computing is performed. “The fundamental technology you have still looks the same as a digital accelerator,” said Naveen Verma, CEO of Encharge AI. “By inserting [adders] inside memory, all you’ve really done is blown the memory up, and the energy is the same as what you would have had had you done it outside the memory. The benefits are incremental compared to standard digital computing.”

It also may be that there’s no one right answer. Instead, it could depend on the nature of the model being executed, particularly with LLMs. “If your context length is small, like 256 tokens, then the weights are dominant,” Expedera’s Chole explained. “But if you’re generating 32,000 tokens, then activations start becoming the important part. If your activation movements dominate the power, then the benefit from storing the weights in the analog domain isn’t going to offset that.”

Another aspect of the debate is whether it’s practical to fully populate a memory with all the weights it will need so that no further movement is necessary. Flash-based approaches claim the benefit of non-volatility so that weights will remain in place even after a power cycle. But that means the device must be sized for the largest models to be encountered. On the other hand, more capacity than necessary wastes silicon, although that analysis assumes only one model in a design. “We actually end up storing multiple models,” said Sagence’s Terrill.

Others suggest that it’s impractical or even inadvisable to house entire models, opting instead for procedures that update the weights during processing. “It has been widely shown in IMC research that weights cannot be stored permanently in the memory,” noted Verma. “The reason is that different bits of data are involved in very different numbers of operations, and if each bit is allocated one memory cell, some of the memory cells would do lots of operations and some would stay mostly idle, leading to low hardware utilization.”

If that view is correct, IMC cannot solve the weight-movement problem as well as hoped. It also makes non-volatile memory impractical because programming times are roughly three orders of magnitude longer than the time necessary for rewriting SRAM. But SRAM is a large, power-hungry cell, further complicating the tradeoffs.

This debate won’t be resolved until the different chips debuting them have a chance to prove themselves in the field. For now, there is no one clear right answer.

Analog’s challenges
Analog IMC isn’t new. Mythic was closely watched as it attempted — and ultimately failed — to bring a flash-based analog IMC inference engine to the market. It promised both lower compute power and reduced data movement due to its use of flash for weight storage. It’s not known specifically what doomed Mythic’s project, but the technology brings some significant challenges. And analog requires difficult tradeoffs. “It’s power, speed or accuracy with analog,” said Drego. “Pick two.”

In a classic implementation, each flash cell holds an entire weight. INT8 is one of the more popular data formats for vision and convolutional neural networks (CNNs), but holding an eight-bit value in a single flash cell is a tall order. And effective accuracy can still suffer. “I have not heard of anything getting above, say, four-bit effective accuracy,” added Drego. “But there are some niche applications where this stuff can be very, very efficient.”

Fig. 2: Shrinking read windows for multi-bit cells. The more bits a cell contains, the finer the divisions and the more sensitive the read mechanism must be. Source: Bryon Moyer/Semiconductor Engineering

Commercial flash cells holding three bits have been around for years. Four-bit cells are now a reality, and five-bit cells are emerging. But no one has eight-bit cells. That requires extreme care, especially if this is to work across multiple cells and dies and wafers and lots, as well as all environmental conditions and after aging. The reality is that one might have to settle for less precision, limiting the technology’s utility.

The aging concern is one that prospective customers have been raising. “When I ask potential customers about analog, aging is one of the things they’re not sure how to handle, and it tends to scare them away,” said Paul Karazuba, vice president of marketing at Expedera.

The worry is that as the cells age, their operation will shift, leading to potential hallucinations in an effect we might call “silicon senility.” In reality, such aging chips aren’t likely to hallucinate, which involves giving a potentially cogent but wrong answer. One is more likely to get gibberish, but that’s still unacceptable.

Aging aside, the manufacturing and environmental variations must somehow be canceled out so that all chips work identically. That has proven extremely challenging, and it’s likely that this issue was a key one that has hurt past efforts. Even without variation, analog by definition has no noise margin in the way digital does. “The problem with analog has always been noise,” observed Verma.

Yet another challenge is that analog computation is only part of an accelerator’s operation. Other functions, such as softmax or nonlinear activation functions, must happen in the digital domain. That means that after each layer calculates its matrix products in analog, the results must be converted to digital to generate activations, which must then be converted back to analog for the next layer. “You wind up with a terrible mess of the activations flowing back and forth,” noted Steve Roddy, chief marketing officer at Quadric.

Maintaining precision through all those conversions requires accurate DACs and ADCs, and those circuits can consume a lot of energy, counteracting one of the main benefits of the architecture.

A final challenge may be cost, although that remains to be seen and depends on the memory technology being implemented. “If you’re building a standalone chip with a variant of flash or DDR, your costs are going to be a lot higher than Micron, Hynix, and Samsung, which are cranking out bazillions,” Drego said.

These issues notwithstanding, a new startup called Sagence [Analog Inference while in stealth] has revealed a new analog IMC offering — one that looks surprisingly similar on the outside to the way Mythic was approaching things. The other newcomer, which hasn’t formally launched yet, is EnCharge AI, and it has a unique sensing technology.

One more time, with feeling
If you squint and look at Sagence’s technology, you’d say it resembles what Mythic was doing. Flash array? Check. Calibration to handle variations? Check. Multi-bit flash cells? Check. Summing in the sense amp? Check. It’s not clear what kind of flash cell Mythic employed, but Sagence says it uses standard flash cells that it has licensed in a NOR configuration. All circuits adapting the flash array to inference lie outside the array, so cell area efficiency is 4F2.

The main difference from prior implementations is that the company operates the flash array in the deep subthreshold regime. This saves orders of magnitude of power, with currents measured in fractions of a nanoamp. But it creates a challenge in that the math is no longer linear, so the Ohm’s Law approach illustrated in Figure 1 no longer works.

Sagence deals with this by storing weights logarithmically. That means different levels in the flash cell aren’t evenly divided. They get closer together with higher values of data. The bottom sections may be easier to detect than the linear version, but the top ones will be harder to discriminate. To make things even more difficult, the overall read window is smaller when operating in deep subthreshold.

Fig. 3: The difference between linear and logarithmic “spacing” between levels. On the left, a linear approach creates equal spaces. On the right, the spacing gets narrower as you move to higher values. (Logarithmic divisions not to accurate scale.) Source: Bryon Moyer/Semiconductor Engineering

Sagence agrees that the conversions between digital and analog must be precise and that those conversions require energy. “If we didn’t have that ADC in there, we would be at three to four orders of magnitude lower current consumption per operation [than non-IMC implementations],” said Vishal Sarin, founder, president, and CEO of Sagence. “But because we use precision ADCs, we lose an order of magnitude.”

Some applications can work with lower precision, however. “The [bits stored per cell] varies depending on the requirements of the network,” explained Terrill. “We can go up to eight. It’s typically less, because we determine what accuracy is required and then store it at that quantization.”

Another surprising aspect is that the multiplication is slightly stochastic. Sagence said it does as much as it can to eliminate systematic errors, but even if it does that perfectly, small random errors will remain. The stochastics effectively make the boundaries between values fuzzy, and at the top end where the sections are very narrow, some may effectively collapse.

The approach works, according to Sagence, because a practical version will have hundreds or even thousands of cells along each bit line. Given those large numbers, the plus-or-minus few-percent errors on so many cells will average out to sufficient accuracy. “You wouldn’t do these kinds of multiplications and additions if you were trying to keep someone’s bank account right accurately,” Sarin said. “But for deep learning it’s a perfect fit.”

Still, for any AI solution, tools are essential so that users aren’t burdened with complexity. Sagence’s compiler takes the logarithmic nature into account at design time, statically allocating weights to cells.

“For hardware-resource identification, everything is done at compile time, meaning you don’t have to do any runtime scheduling,” explained Suhas Nayak, senior director, product marketing at Sagence. “The analog quantizer does hardware-aware training, noise-aware training, and generates information for further calibration if and when needed during runtime.” The benefit of this static scheduling is fixed, predictable latency.

The company has multiple means of dealing with variation, drift, and aging. Calibration deals with manufacturing variations, but Sagence also monitors cells with an option to reload the weights if things drift too far. “There’s metrology circuitry that observes what’s happening over time on the flash cells, and we can reload them as required if they reach a point where we can’t mitigate it with the underlying circuitry,” Terrill noted.

Based on the architecture, it would seem this would be an integer-only solution, which would box it out of attention-based networks but the company has other plans. “We plan to implement attention in our Gen AI solutions using proprietary methods,” said Sarin. “It is very much part of our solution.”

A tip of the cap
EnCharge takes a completely different approach in three major areas — the type of memory cell, the number of bits stored per cell, and the means of sensing the result. The latter was the big breakthrough, since all the prior current-sensing schemes show great variability with manufacturing and environmental conditions. Before spinning out of Princeton University, the company found that capacitors can store charge as a sensing mechanism with none of that dependency.

“What’s important about this capacitor is it has no temperature dependence,” explained Verma. “It has no material parameter dependencies. It’s perfectly linear. It depends only on the space between the wires. This scales to the most advanced nodes because they give you better geometry control.”

The company has determined that accumulation requires better precision than multiplication, and that’s what the capacitors provide. The array consists of SRAM cells that each store one bit of the weight. As EnCharge hasn’t yet formally launched its technology, there are many details that remain undisclosed. The gist is that the SRAM cell provides multiplication, and each result controls a switch that places charge on the capacitor.

The capacitor is physically placed above the SRAM cell between two upper layers of metal, and is thus easy to build and consumes no additional real estate. The capacitors for a single MAC are connected on the same plate. The opposite capacitor plate then averages all those charges, which effectively provides addition. It’s an analog value and so requires an ADC to convert to the digital regime. No DAC is necessary.

EnCharge’s array requires data reloading because the memory is volatile, and the array isn’t big enough for an entire model. “To minimize overhead, we use a virtualized architecture similar to virtual memory,” said Verma.

Although this sensing approach seems novel, the company says that it’s been proven in high-precision circuits such as ADCs. “Its reliability, its scalability, its accuracy and manufacturability have been proven through these other ultra-high-precision analog circuits,” noted Verma.

DRAM may join the game
In one final new idea, startup Neo Semiconductor has proposed an IMC play using its 3D DRAM. “We can perform a huge amount of the calculation in our 3D DRAM array without sending it to SRAM,” said Andy Hsu, CEO and co-founder of Neo.

Neo’s primary development focus is on stacked 3D DRAM using floating-body charge storage instead of a capacitor. Like EnCharge, each DRAM cell holds one weight bit. That suggests the multiplication technique is similar to what EnCharge does, but Neo hasn’t yet disclosed how it handles multi-bit multiplication.

The sensing is different, however. The vertical bit line carries an analog current that’s measured and digitized. It also has an ADC but no DAC. The approach works directly for integer data simply by engaging the appropriate number of DRAM bits (typically eight for INT8). The company says a floating-point unit is necessary for attention-based networks, but it hasn’t yet disclosed how that will work.

Neo is targeting HBM in two stages. First it can replace the memory dice in the HBM stack with its 3D DRAM version, which could boost capacity by two orders of magnitude. The second phase replaces sense-amp circuits on the HBM base die with neural ones. “For AI, the bottom die is replaced with [one that can] perform activation functions,” said Hsu.

One possible concern for this technique is heat. In a DRAM, a typical bit line will reflect the value of a single cell. With AI, you’re measuring the current of multiple cells. HBM is already thermally challenged. Increasing the amount of current yet further would seem questionable without further mitigations.

Neo says that because its DRAM technology is different, it has read current that’s about 10% of what’s typical for standard DRAM, so it starts with less current. It works eight bits at a time, which should keep the current below standard DRAM current. Eight bits isn’t much, but the company divides the overall data into eight-bit groups and processes them sequentially. This approach could impact latency, but no numbers are yet available.

This proposal needs more proving out before it’s real. Neo’s primary focus today is on its 3D DRAM. The AI play is a further idea the company had, disclosed just this year. But at this point it’s still conceptual.

Analog IMC lives another day
Analog IMC has been eagerly awaited for many years, and many have clearly decided that it’s not yet ready for prime time. It’s a ripe topic for universities to research, and many folks unaware of the impending launches have been convinced that some big changes will be necessary to get it to work.

Sagence’s technology attempts to follow the path already traveled while avoiding prior pitfalls. EnCharge AI has focused on capacitors for sensing accumulation. Neo’s DRAM ideas are too new to know for sure whether it can be commercially successful. The next year should provide an opportunity to see whether either of the first two achieves traction. If not, it’s back to the research lab.

If either or both works, however, a new standard for low power will have been established for inference. “The energy savings of not having to move all that data and the parallelism that IMC promises will massively impact not just AI, but any computation that takes place on large arrays of data,” said Russ Klein, program director, high-level synthesis division at Siemens EDA.

We’ll also have hard data showing which of the theories regarding data-movement power vs. computing power is correct.

Related Reading
Higher Density, More Data Create New Bottlenecks In AI Chips
More options are available, but each comes with tradeoffs and adds to complexity.
Mass Customization For AI Inference
The number of approaches to process AI inferencing is widening to deal with unique applications and larger and more complex models.
HW and SW Architecture Approaches For Running AI Models
Custom hardware tailored to specific models can unlock performance gains and energy savings that generic hardware cannot achieve, but there are tradeoffs.



Leave a Reply


(Note: This name will be displayed publicly)