In-Memory Vs. Near-Memory Computing

New approaches are competing for attention as scaling benefits diminish.

popularity

New memory-centric chip technologies are emerging that promise to solve the bandwidth bottleneck issues in today’s systems.

The idea behind these technologies is to bring the memory closer to the processing tasks to speed up the system. This concept isn’t new and the previous versions of the technology fell short. Moreover, it’s unclear if the new approaches will live up to their billings.

Memory-centric is a broad term with different definitions, although the latest buzz revolves around two technologies—in-memory computing and near-memory computing. Near-memory incorporates memory and logic in an advanced IC package, while in-memory brings the processing tasks near or inside the memory. Both technologies are valid and can be used for different apps.

Both in- and near-memory are aimed at boosting the data processing functions in today’s systems, or driving new architectures such as neural networks. With both approaches, a processor handles the processing functions, while both memory and storage stores the data.

In systems, data moves between the memory and a processor. But at times this exchange causes latency and power consumption, which is sometimes referred to as the memory wall.

The industry is working on solutions. “Everybody is striving for a chip that has 100 TeraOPS of performance,” said Steve Pawlowski, vice president of advanced computing solutions at Micron Technology. “But to get the efficiency of that chip, you must have several things going on simultaneously. This means having to bring data into the chip and get it out of the chip as fast as possible.”

That’s where in- or near-memory computing fits, bringing memory closer or integrating it into processing tasks to boost the system. Both technologies are attractive for other reasons—it may give the industry another option besides traditional chip scaling.

In scaling, the idea is to make devices smaller with more functions at each node. But chip scaling is becoming harder and more expensive at each turn, especially for logic devices and DRAM.

In some cases, though, these memory-centric architectures perform different tasks with chips that don’t always require advanced nodes. Both in- and near-memory computing will not replace chip scaling, but they do provide other options.

What is in-memory computing?
In today’s systems, the traditional memory/storage hierarchy is straightforward. For this, SRAM is integrated into the processor for cache, which can quickly access frequently used programs. DRAM, which is used for main memory, is separate and located in a dual in-line memory module (DIMM). And disk drives and NAND-based solid-state storage drives (SSDs) are used for storage.


Fig. 1: Memory/storage hierarchy. Source: Lam Research

Based on this hierarchy, systems face an explosion of data on the network. For example, IP traffic is expected to reach 396 exabytes (EB) per month by 2022, up from 122 EB per month in 2017, according to Cisco.

The data growth rates are accelerating. “If you look at some of the drivers, you have mobile applications. There is a need for more data as you go into 5G networks. You have more video and higher screen resolutions,” said Scott Gatzemeier, vice president of R&D operations at Micron, at a panel during the recent IEDM conference. “Then, if you look at some of the AI applications on the phones with facial recognition and authentication, it’s driving not only larger memory but also the need for faster memory.”

The data explosion is having an impact on systems. “As the amount of data increases in our world from tens of terabytes to hundreds of terabytes inside a server, we are faced with a problem in moving the data back and forth from SSDs to the CPUs. It’s going to be an energy problem and we will run into several system bottlenecks,” said Manish Muthal, vice president of the Data Center unit at Xilinx, at the panel.

During the panel, Jung Hoon Lee, head of DRAM device and process integration at SK Hynix, summarized the problem: “The data is growing faster compared to the computing performance. There is a need for some middle layers to solve the problem.”

One way to solve this is to integrate processors, memory and other devices in a traditional von Neumann architecture. Scaling these devices will provide more performance, but this adds cost and complexity to the equation.

Another approach is to move toward these newfangled in- and near-memory architectures. “We are seeing trends around integrating new memory technologies,” said Yang Pan, corporate vice president of advanced technology development at Lam Research. “The growing trend of near-memory computing and in-memory computing will drive new architectures that are integrating logic (digital and analog) and new memories.”

What what exactly is in-memory computing? Today, there is no single definition or approach.

“You will get different answers about in-memory computing depending on who you ask,” said Gill Lee, managing director of memory technology at Applied Materials. “There are products coming out in that direction. In-memory computing is happening now using existing memory technology. The products are being built specifically for those applications. That will drive more segmentation in memory applications.”

The term “in-memory computing” isn’t new and can be used in various ways. Among them are:

  • The database world uses in-memory computing for caching and other apps.
  • Chipmakers are developing chip technologies to handle the processing tasks in memory for neural networks and other applications.
  • There are some newfangled approaches in the works, namely neuromorphic computing.

For years, Oracle, SAP and others have used in-memory computing in the database world. A database is stored and accessed in a computer. In a traditional database, the data is stored in a disk drive. But accessing the data from the drive can be a slow process. So database vendors have developed ways to process the data in the main memory in a server or subsystem, not in the disk drive. This, in turn, boosts the speeds of the transactions.

That’s a simple way of explaining a complicated topic. Nonetheless, in the database world, that’s called in-memory computing or in-memory database.

In the database world, the use of in-memory computing is based on the classical approach. “They still use the same von Neumann capability and programming model,” Micron’s Pawlowski said. “It’s trying to find the best way to co-locate the data to that process to make it faster.”

In the semiconductor/systems world, in-memory computing has the same basic principle with a different twist—you bring memory closer or inside the processing functions in various systems. In the past, this this technology was sometimes referred to as “processing in memory.” For years, vendors introduced various devices in the arena, but many of those efforts failed or fell short of their promises.

Recently, several companies have introduced new and improved versions of this technology. There are various approaches using DRAM, flash and the new memory types. Many of these are billed as in-memory computing. That’s not to be confused with in-memory in the database world.

Many of the new and so-called in-memory chip architectures are designed to drive neural networks. In neural networks, a system crunches data and identifies patterns. It matches certain patterns and learns which of those attributes are important.

Neural nets consist of multiple neurons and synapses. A neuron could consist of a memory cell with logic gates. The neurons are daisy-chained and connected with a link called a synapse.

Neural networks function by calculating matrix products and sums. It consists of three layers—input, hidden, and output. In operation, a pattern is first written in a neuron in the input layer. The pattern is broadcast to the other neurons in the hidden layers.


Fig. 2: DNNs are largely multiply-accumulate Source: Mythic

Each neuron reacts to the data. Using a weighted system of connections, one neuron in the network reacts the strongest when it senses a matching pattern. The answer is revealed in the output layer.

Neural nets are different than traditional systems. “If you are doing a pass through a neural network, you have maybe tens of megabytes or even hundreds of megabytes of weights that need to be accessed,” said Dave Fick, CTO of Mythic, an AI chipmaker. “But they are accessed basically once for each layer and then you have to discard that weight and get a different one for memory in the later stages of the network.”

In some systems, a neural network is based on a traditional chip architecture using GPUs. A GPU can handle multiple operations, but it needs “to access registers or shared memory to read and store the intermediate calculation results,” according to Google. This may impact the power consumption of the system.

There are different ways of performing these tasks. For example, startup Mythic recently introduced a matrix multiply memory architecture. It performs the computations inside the memory using a 40nm embedded NOR flash technology.

This is different than traditional computing using processors and memory. “If you build a processor with hundreds of megabytes of SRAM, you can fit your entire application on there. But you still have to read the SRAM and get that data to the correct processing elements,” said Mythic’s Fick. “We’re avoiding that by doing the processing directly within the memory array itself. The goal is to minimize that data movement as much as possible. We have an aggressive approach where we are not going to move the data at all, let alone moving it from DRAM to on chip. We are also not going to worry about moving data out of the memory in the first place.”

Typically, NOR stores data in an array of memory cells. Mythic uses a NOR bit cell, but it replaces the digital peripheral circuitry with analog. “Our approach is to do analog computing inside the array. Our array has digital interfaces,” he said. “Mythic does this in a 40nm process, while these other systems are in much newer process nodes. While other system designers are struggling to get from 7nm to 5nm, Mythic will be scaling to 28nm.”

By definition, Mythic is handling the computing tasks in-memory. There are other novel ways to perform in-memory computing tasks, as well. Some have all-analog approaches, while others are developing SRAM- and capacitor-based technologies. All technologies are in various stages of development.

The industry also has been working on a non-traditional approach called neuromorphic computing. Some call it compute in memory, which is still several years away from being realized.

Compute in memory also uses a neural network. The difference is the industry is attempting to replicate the brain in silicon. The goal is to mimic the way that information is moving from one group of neurons to another using precisely-timed pulses.

“This is where you’re building a compute structure in the memory or storage process technology. You tend to co-locate the compute function inside there,” Micron’s Pawlowski said. “For example, we can read a line of memory, and then put it in a smaller DRAM structure and have a good cache that has extremely low latencies.”

For this, the industry is looking at several next-generation memory technologies, such as FeFETs, MRAM, phase-change and RRAM. All of these are attractive because they combine the speed of SRAM and the non-volatility of flash with unlimited endurance. The new memories have taken longer to develop, though, as they use exotic materials and switching schemes to store information.

Nevertheless, neuromorphic computing is a different paradigm with numerous challenges.

“In neuromorphic, pulses can come in at any particular time. You can quantize them in certain ways, but they are asynchronous types of computes. It’s really these pulses that come in from the various axons. They don’t come in the same clock boundaries,” Pawlowski said. “The other question is how do you make it easy enough for the programmer to use it and not make it so difficult. A lot of the work we are doing is where we find a usage model in the software framework to start making this transition of reducing the power and the energy, as well as increasing the performance of moving processing closer to, and eventually, inside a memory array.”

Still to be seen, meanwhile, is which is the best memory type for the task. “I don’t know which type of memory will win, but it’s going to be memory driven. We must solve the power density problem. We are right in the beginning of also changing the programming model to leverage this. The memories are going to be hierarchical. It’s going to be multi-level and distributed,” said Renu Raman, vice president and chief architect for cloud architecture and engineering at SAP, at the recent IEDM panel.

What is near-memory?
Besides in-memory technology, it’s also possible to incorporate the memory and logic chips in an advanced IC package, such as 2.5D/3D and fan-out.

Some refer to this as near-memory computing. Like in-memory, the idea is to bring the memory and logic closer in the system.

“The world is driving more data in systems. So, processors need large amounts of memory. And the memory and processor need to be very close,” said Rich Rice, senior vice president of business development at ASE. “So, you need packaging solutions that enable it, whether it’s 2.5D or a fan-out with a substrate approach. This could also be PoP structures like package-on-package.”

In 2.5D, dies are stacked on top of an interposer, which incorporates through-silicon vias (TSVs). The interposer acts as the bridge between the chips and a board, which provides more I/Os and bandwidth.

For example, a vendor could incorporate an FPGA and high-bandwidth memory (HBM). HBM stacks DRAM dies on top of each other, enabling more I/Os. For example, Samsung’s latest HBM2 technology consists of eight 8Gbit DRAM dies, which are stacked and connected using 5,000 TSVs. This enables 307GBps of data bandwidth. In a traditional DDR4 DRAM, the maximum bandwidth is 85.2GBps.

The next HBM version is called HBM3, which enables 512GBps of bandwidth. It would have a density of 128Gbit, compared to 64Gbit for HBM2.

Besides 2.5D, the industry is working on 3D-ICs. In 3D-ICs, the idea is to stack memory dies on a logic chip, or logic dies on each other. The dies are connected using an active interposer.

“2.5D enables an order of magnitude increase in interconnect density. What you are trying to address is memory bandwidth and latency,” explained David McCann, vice president of post fab development and operations at GlobalFoundries.

3D-ICs enable more bandwidth. “Instead of interconnecting on the edges of chips, you are using the whole ‘X’ by ‘Y’ surface area,” McCann said

In addition, the industry is working on a version of high-density fan-out with HBM. “It is intended to be an alternative to an interposer solution for these markets. It provides a lower-cost solution, and actually has better electrical and thermal performance than a silicon interposer structure,” said John Hunt, senior director of engineering at ASE.

Clearly, there is a lot activity, if not confusion, for both in- and near-memory technology. It’s unclear which technologies will prevail. The dust has yet to settle in this arena.

Related Stories

In-Memory Computing Challenges Come Into Focus

What’s the Right Path For Scaling?



Leave a Reply


(Note: This name will be displayed publicly)