Memory Issues For AI Edge Chips

In-memory computing becomes critical, but which memory and at what process node?


Several companies are developing or ramping up AI chips for systems on the network edge, but vendors face a variety of challenges around process nodes and memory choices that can vary greatly from one application to the next.

The network edge involves a class of products ranging from cars and drones to security cameras, smart speakers and even enterprise servers. All of these applications incorporate low-power chips running machine learning algorithms. While these chips have many of the same components as other digital chips, a key difference is that the bulk of the processing is done in or near the memory.

With that in mind, the makers of AI edge chips are evaluating different types of memory for their next devices. Each comes with its own set of challenges. In addition, the chips themselves must incorporate low-power architectures, despite the fact that in many cases they are using mature processes rather than the most advanced nodes.

AI chips — sometimes called deep-learning accelerators or processors — are optimized to handle various workloads in systems using machine learning. A subset of AI, machine learning utilizes a neural network to crunch data and identify patterns. It matches certain patterns and learns which of those attributes are important.

These chips are targeted for a whole spectrum of compute applications, but there are distinct differences in those designs. For example, chips developed for the cloud typically are based on advanced processes, and they are expensive to design and manufacture. And edge devices, meanwhile, include chips developed for the automotive market, as well as drones, security cameras, smartphones, smart doorbells and voice assistants, according to The Linley Group. In this broad segment, each application has different requirements. For example, a smartphone chip is radically different than one created for a doorbell.

For many edge products, the goal is to develop low-power devices with just enough compute power. “You can’t afford a 300-watt GPU here. Even a 30-watt GPU is too much for a lot of these applications,” said Linley Gwennap, principal analyst with The Linley Group. “But device makers still want to do something that’s sophisticated. That requires more AI capabilities than you can get from a microcontroller. You need something that’s pretty powerful, but won’t drain the battery or empty out your wallet, particularly if it’s a consumer application. So you have to look at some fairly radical new solutions.”

For one thing, most edge devices don’t require chips at advanced nodes because they are too expensive. There are exceptions, of course. In addition, many AI edge chips handle the processing functions in or near the memory, which speeds up the system using less power.

Vendors are taking various memory approaches, and are exploring new ones for future chips. Among them:

  • Use conventional memories like SRAM and others.
  • Use NOR flash for a newer technology called analog in-memory computing.
  • Utilize phase-change memory, MRAM, ReRAM and other next-generation memories, which are being explored for AI edge chips.

AI explodes
Machine learning has been around for decades. For years, though, systems didn’t have enough horsepower to run these algorithms.

Recently, machine learning has taken off, thanks to the advent of GPUs and other chips, as well as machine-generated algorithms.

“Machine learning started to get useful in the 1990s,” said Aki Fujimura, chief executive of D2S. “But that has changed in recent years with the advent of GPUs. GPUs enabled deep learning to happen because there is so much more compute power available.”

The goal for these and other devices is to process algorithms in a neural network, which calculates matrix products and sums. A data matrix is loaded into the network. Then, each element is multiplied by a predetermined weight, and the result is passed to the next layer of the network and multiplied by a new set of weights. After several such steps, the result is a conclusion about the data.

Machine learning has been deployed across a number of industries, and in the semiconductor industry, dozens of machine learning chip suppliers have emerged. Many are those developing chips for the cloud. In systems, these chips are designed to accelerate web searches, language translations, among other apps. The market for these devices exceeded $3 billion in 2019, according to The Linley Group.

Dozens of AI edge chip vendors also have emerged, such as Ambient, BrainChip, GreenWaves, Flex Logix, Mythic, Syntiant and others. In total, 1.6 billion edge devices are expected to ship with deep-learning accelerators by 2024, according to the firm.

AI edge chips also run machine learning algorithms using 8-bit computations. “You want to process data where it’s being generated and used. There are some huge advantages here. When we started, it was about battery life. If you don’t open a connection to the Internet, and you can do the AI locally, you save a tremendous amount of power. Responsiveness is also important, as well as reliability — and ultimately, privacy,” said Kurt Busch, chief executive of Syntiant. “In deep learning, it’s also all about memory access. Your power as well as your performance bottlenecks are all about the memory. Second, it’s also parallel processing. In deep learning, I can do millions of multiplies and accumulates in parallel and effectively linearly scale with parallel processing.”

AI edge chips have different requirements. For example, smartphones incorporate leading-edge application processors. But that’s not the case for other edge products like doorbells, security cameras and speakers.

“For solutions that are targeted on the edge, there is an economics question involved. It has to be cost-sensitive. The whole intent is to have competitive costs, low power, and distribution of the compute to make it easier,” said Walter Ng, vice president of business development at UMC.

There are other considerations. Many AI edge chip vendors are shipping products at mature nodes, namely 40nm or so. This process is ideal on one level, because it’s inexpensive. But going forward, suppliers want more performance with low power. The next node is 28nm, which is also mature and inexpensive. Recently, foundry vendors have introduced various 22nm processes, which are extensions of 28nm.

22nm is slightly faster than 28nm, but the prices are higher. Most vendors won’t migrate to finFETs at 16nm/14nm, because it’s too expensive.

Moving to the next node isn’t a simple decision. “Many customers and their applications are at 40nm today,” Ng said. “As they look to their next node roadmap, are they going to be satisfied and get the best bang for the buck on 28nm? Or will 22nm look more attractive and offer more benefits than 28nm? That’s a consideration many are looking at.”

Using traditional memory
There are other considerations, too. In traditional systems, the memory hierarchy is straightforward. For this, SRAM is integrated into the processor for cache, which can access frequently used programs. Used for main memory, DRAM is separate and located in a memory module.

In most systems, data moves between the memory and a processor. But this exchange causes latency and increased power consumption, which is sometimes referred to as the memory wall, and it can be increasingly problematic as the amount of data rises.

That’s where in- or near-memory computing fits in. In-memory brings the processing tasks inside the memory, while near-memory brings the memory close to the logic.

Not all chips use in-memory computing. AI edge chip vendors, however, are using these approaches to help break down the memory wall. They are also offloading some of the processing functions from the cloud.

Last year, for example, Syntiant introduced its first product, a “Neural Decision Processor” that incorporates a neural network architecture in a tiny, low-power chip. The 40nm audio device also incorporates an Arm Cortex-M0 processor with 112KB of RAM.

Fig. 1: Syntiant’s NDP100 Audio Neural Decision Processor Source: Syntiant

Using SRAM-based memory, Syntiant classifies its architecture as near-memory computing. The idea behind the chip is to make voice as the main interface in systems. Amazon’s Alexa is one example of an always-on voice interface.

“Voice is the natural next-generation interface,” Syntiant’s Busch said. “We purpose-built these solutions to add an always-on voice interface to almost any battery-powered device, from as small as a hearing aid to as large as a laptop or smart speaker.”

Going forward, Syntiant is developing new devices and is looking at different memory types. “We’re looking at some of the emerging memory technologies like MRAM and ReRAM, mostly for the density,” said Jeremy Holleman, chief scientist at Syntiant. “Read power and then idle power is also a big thing, because you wind up having a lot of memory for a large model. But then, maybe you’re only doing the compute on a fairly small subset at a given instance. The ability of a memory cell to lower its power when it’s not being used is pretty critical.”

Advanced processes aren’t required for now. “For the foreseeable future, the advance-node leakages are too high for ultra low-power applications,” Syntiant’s Busch said. “Devices at the edge often are doing nothing. They’re powered up and waiting for something to happen, as opposed to a device in a data center. You want that to be doing something all the time. Devices on the edge are often waiting for something to happen. So you need very low power consumption, and the advanced nodes are not good at that.”

There are some issues. Most AI chips today rely on in-built SRAM, which is fast. “But fitting millions of weights in a standalone digital edge processor using SRAM is very expensive, irrespective of the technology,” said Vineet Kumar Agrawal, design director of the IP Business unit at Cypress. “Getting data from a DRAM is 500X more expensive than getting it from an internal SRAM.”

Meanwhile, many AI edge chip vendors are using or looking at another memory type—NOR. NOR, a nonvolatile flash memory, is used in standalone and embedded applications. NOR is often used for code storage.

NOR is mature and robust, but it requires extra and expensive mask steps at each node. And it’s difficult to scale NOR beyond 28nm/22nm. Nevertheless, using today’s NOR flash, several companies are developing a technology called analog in-memory computing. Most of these devices start at the 40nm node.

“If you look at a traditional digital AI architecture, the two big sources of power consumption are going to be doing the computations—the multiply and add. And then the second thing is moving the data from the memory to the compute unit and back again,” explained Linley Group’s Gwennap. “What some are trying to do is solve both of those problems. They are putting the computation right into the memory circuitry so that you don’t have to move the data very far. Instead of using a traditional digital multiplier, they are using analog techniques where you can run the current through a variable resistance. And then you can use Ohm’s Law to calculate the product of the current and the resistance. And there’s your multiply.”

Analog-in-memory promises to reduce power. But not all NOR is alike. For example, some NOR technologies are based on a floating-gate architecture.

Using a NOR-based floating-gate approach, Microchip has developed an analog in-memory computing architecture for machine learning. The technology integrates a multiply-accumulate (MAC) processing engine.

“With this approach, a user doesn’t need to store model parameters or weights in an SRAM or external DRAM,” said Vipin Tiwari, director of embedded memory product development at Microchip’s SST unit. “Input data is provided to the array for MAC computation. That eliminates the memory bottleneck in MAC computation because computation is done where the weights are stored.”

There are other NOR options. For example, Cypress for some time has been offering a different embedded NOR flash technology called SONOS. Based on charge-trap flash, SONOS is a two-transistor technology. The threshold voltage can be changed by adding or removing an electric charge from the nitride layer. It’s available on various nodes down to 28nm.

SONOS can be optimized as an embedded memory for machine learning. “Two SONOS multi-bit embedded non-volatile memory cells can replace up to 8 SRAM cells, which is 48 transistors. There is an area efficiency here. You also get a 50X to 100X improvement both in power efficiency as well as throughput,” Cypress’ Agrawal said. “SONOS is programmed using a highly linear and low power tunneling process capable of targeting Vts with high control, resulting in a nanoamp bit-cell current level. This is opposed to floating gate, which uses hot electrons, where you don’t have control over how much current is going into the cell. Plus, your cell current is much higher.”

Using new memories
But with NOR unable to scale beyond 28nm/22nm, AI edge chip vendors are looking at several next-generation memory types, such as phase-change memory (PCM), STT-MRAM, ReRAM and others.

For AI, these memories also run machine learning apps with neural networks.

Fig. 2: Analog Compute-In-Memory Accelerators For ML Using New Memories Source: Imec

These memory types are attractive because they combine the speed of SRAM and the non-volatility of flash with unlimited endurance. But the new memories have taken longer to develop because they use complex materials and switching schemes to store data.

“Semiconductor manufacturers are faced with new challenges when migrating from charge-based memory (SRAM, NOR) to resistive memory (ReRAM, PCM),” said Masami Aoki, Asia regional director for process control solutions at KLA. “These emerging memories are composed of new elements and require precise control of material properties and new defect control strategies to ensure performance uniformity and reliability, especially for large scale integration.”

For some time, though, Intel has been shipping 3D XPoint, which is a PCM. Micron also is shipping PCM. A nonvolatile memory, PCM stores data by changing the state of a material. It’s faster than flash with better endurance.

PCM is a challenging technology to develop, although vendors have addressed the issues. “With 3D XPoint, which is phase-change memory, the chalcogenides are notoriously sensitive to ambient conditions and process chemistry,” said Rick Gottscho, executive vice president and CTO at Lam Research. “There are a variety of technical strategies to deal with all of those things.”

PCM is also being targeted for AI. In 2018, IBM presented a paper on an 8-bit precision in-memory multiplication technology using PCM. IBM and others continue to work on PCM for AI edge apps, although no one is shipping products in volumes.

STT-MRAM also is shipping. It features the speed of SRAM and the non-volatility of flash with unlimited endurance. It uses the magnetism of electron spin to provide non-volatile properties in chips.

STT-MRAM is ideal for embedded applications, where it’s targeted to replace NOR at 22nm and beyond. “If you look at the new memories, MRAM is the best for low-density, something less than one gigabit. MRAM is the best embedded memory. It’s better than NOR even if you could do NOR at that generation like 28nm or larger. NOR adds a 12+ masks so MRAM is the preferred option for embedded based on cost, density and performance,” said Mark Webb, principal at MKW Ventures Consulting.

MRAM, however, can support only two levels, so it is not suitable for in-memory computing, according to some experts. Others have a different viewpoint. “An MRAM device indeed will only store a single bit,” said Diederik Verkest, distinguished member of the technical staff at Imec. “In the context of in-memory compute, however, it is important to understand there is a difference between the memory device and the compute-cell. The compute cell executes the multiplication of the stored weight and the input activation. In the most optimal case, the memory device inside the compute cell can store multiple levels of weight. However, it is possible to make compute cells where the weight is stored using multiple memory devices. If 3-level weights are used (a weight can then be -1, 0, 1), then two memory devices can be used and the compute cell will consist of the two memory devices and some analog circuitry around that to calculate the product of the weight value and activation. So MRAM devices can be used inside compute cells to store multi-level weights and build compute-in-memory solutions.”

ReRAM is another option. This technology has lower read latencies and faster write performance than flash. In ReRAM, a voltage is applied to a material stack, creating a change in the resistance that records data in the memory.

At the recent IEDM conference, Leti presented a paper on the development of an integrated spiking neural network (SNN) chip using both analog and ReRAM technologies. The 130nm test chip had a 3.6pJ energy dissipation per spike. A device using 28nm FD-SOI is in R&D.

SNNs are different than traditional neural nets. “It doesn’t use any power until the input changes,” Linley Group’s Gwennap said. “So, in theory, it’s ideal if you have a security camera and it’s looking at your front yard. Nothing changes until somebody walks by.”

Leti’s SNN device is ideal for the edge. “It remains to be seen what is exactly meant by the edge, but I can state that ReRAM and SNNs are especially tailored to endpoint devices,” said Alexandre Valentian, research engineer at Leti. “ReRAM and spike coding are a good fit, because this coding strategy simplifies in-memory compute. There is no need for a DAC at the input (as in matrix-vector multiplication) and it simplifies the ADC at the output (lower number of bits) or eventually removes it altogether if the neurons are analog.”

ReRAM, however, is difficult to develop. Only a few have shipped parts. “ReRAM was shown by us and others to be great in theory for 1T1R designs (embedded) and for 1TnR in the future with proper crosspoint selectors. The challenge is in the slow development of actual products in the last two years. We believe this is due to issues with retention and disturbs versus cycling in the storage element itself. These need to be resolved and we need actual products at 64Mbit embedded and 1Gbit crosspoint,” MKW’s Webb said.

All told, there is no consensus which next-generation memory type is suitable for AI edge apps. The industry continues to explore current and future options.

For example, Imec recently evaluated several options to enable a 10000TOPS/W matrix-vector multiplier using an analog in-memory computing architecture called AiMC.

Imec evaluated three options — SOT-MRAM; IGZO DRAM; and projection PCM. Spin-orbit torque MRAM (SOT-MRAM) is a next-generation MRAM. Indium-gallium-zinc-oxide (IGZO) is a novel crystal morphology.

“Several device options can be used to store the weights of the DNN. The devices listed use different mechanisms to store the weight values (magnetic, resistive, capacitive) and lead to different implementations of the AiMC array,” Imec’s Verkest said.

It’s still unclear which current or future next-generation memory technology is the winner. Perhaps there is a place for all technologies. SRAM, NOR and other conventional memories also have a place.

Clearly, though, it’s unlikely there is room for dozens of AI chip vendors. There are already signs of a shakeout, where larger companies are beginning to buy the startups. As with any new chip segment, some will succeed, some will be acquired, and others will fail.


Harry says:

What is your take on “Non-filamentary interface switching ReRAM” being developed by 4DS/ Western Digital/ IMEC.

Mark LaPedus says:

Here’s what Mark Webb, principal at MKW Ventures Consulting, says about this:

1) ”4DS continues to develop technology with steady progress and publishes results openly to investors and the public. 4DS has proposed their technology as a solution with >1M cycles endurance and DRAM like speed.”

2) ”4DS is in our opinion is behind filamentary companies (ie Crossbar) on the timeline to production in terms of wafers produced and process and product learning. But no ReRAM technology appears to be running at advanced nodes in production as of today.”

3) ”While filament switching has limits and challenges today, it is not obvious from the data shown so far that interface switching has clearly better performance. We would look to see more published data in journals and conferences on >1Mbit arrays with cycling vs retention results.”

4)” Our Emerging Memory Product Lifecycle compares timelines of all memory technologies and shows that filamentary solutions are at least 2 years from volume production and 4DS is 4-5 years from volume production. If a large semiconductor company prioritized ramping the technology, they could pull in these timelines by 50%.”

Karthikeyan Srinivasan says:

Nice article but I have a Curious question –

Will the tools used for these chip be embedded with AI chips so as basic machine learning and predictive analysis on the tools’ logs is actually done right from the time tool starts running the products in it ?

Leave a Reply

(Note: This name will be displayed publicly)