Solving The Memory Bottleneck

Moving large amounts of data around a system is no longer the path to success. It is too slow and consumes too much power. It is time to flip the equation.

popularity

Chipmakers are scrambling to solve the bottleneck between processor and memory, and they are turning out new designs based on different architectures at a rate no one would have anticipated even several months ago.

At issue is how to boost performance in systems, particularly those at the edge, where huge amounts of data need to be processed locally or regionally. The traditional approach has been to add more compute capability into a chip and bring more memory on-chip. But that approach no longer scales, so engineers have begun focusing on solving the bottleneck between processors and memories.

The rise in awareness in the migration of compute to memory has been quite dramatic. Until very recently, it was considered a suitable topic for research and generally dismissed by systems companies. Even at the Design Automation Conference (June 2019) there were many naysayers. But by HotChips (September 2019) several companies had released products, and several of those already have signed up their first round of customers.

There are many variants to the released products and those still waiting in the wings to be announced. “Generally, the notion is that you have to stop moving data as much and do the processing in situ — where the data lives,” says Ben Whitehead, storage specialist in the emulation division of Mentor, a Siemens Business. “This trend has really taken off. It has hit the knee, or the big bend, in the innovation and adoption curve. That curve starts with experimental and it spends a long time in that area. Just this year, it is coming off that knee and it starting to grow at a much faster rate.”

Interestingly, the same question is being asked at several levels within the industry. “This is the same discussion we have when people look at edge processing,” says Frank Schirrmeister, senior group director of product management at Cadence. “The question in the networking domain is, ‘Where do I process the data and how much data do I transmit?’ It is a balance between the network and the communication fabric versus processing closer to the memory.”

Similar issues are being seen in the server market, with processing in storage. “We see it when looking at traditional applications and workloads,” says Scott Shadley, vice president of marketing for NGD Systems. “For example, there is a shift into where the database management layers are being placed. It is no longer just in-memory like Oracle or Hadoop. There is some muddying of that water.”

These changes have been brewing for some time. “It is all about the limitations of von Neumann at the architectural level and Moore’s Law and Dennard scaling that has created issues with scaling efficiently,” says Sylvain Dubois, vice president of business development and strategic marketing for Crossbar. “It is about how much power and energy we have to consume to compute at these nodes. This is a great opportunity for new architectures.”

Could this just be short-term hype? “It is hard to overstate the legs this movement has,” says Mentor’s Whitehead. “It will change the industry a lot. There are still a lot of questions to be answered, but there are products in the market today. The numbers that are being seen in terms of benchmarks are enormous.”

Quiet start
The beginnings of this shift tended to fall under the radar. “The GPU is a kind of solution to this, but it does not really solve it,” says Crossbar’s Dubois. “It is just giving a little bit of extra room because it is highly parallelized. It is still based on the same memory bottleneck. People have realized that new architectures such as CNNs or the Google TPU are new architectures. Companies are now investing vertically all the way from the processors and the memory integrations to the semiconductor business. It is a great news for the semiconductor industry.”

It is also seen by many as a necessary change in direction. “Today, people are talking about having an engine optimized to do AI, and then discussing if the engine sits on a separate chip or within the memory,” says Gideon Intrater, chief technology officer for Adesto. “Bandwidth is improved when you are on-chip, but there are some solutions today that go beyond that — solutions that are doing the computations really within the memory array or very close to it by utilizing analog functions.”

Intrater points to a number of different possibilities. “Rather than using hundreds of hardware multipliers to do matrix operations, you could potentially take each 8-bits and run those through a D2A and do the computation in an analog fashion, where you just use Kirchhoff’s Law to do the multiply. It is not as exact or accurate as doing it in a digital manner, but for most cases it is good enough. And by doing that, vendors claim significantly faster operation and lower power. Even that is not the leading edge. The very leading edge is to store the bits in memory as analog values and use the resistance of the non-volatile memory (NVM) as the value that is stored in a weight and then drive current through that and use that to do the multiplication. So there are at least two steps beyond doing the operations in a digital way that appear to be quite promising. These are really in-memory processors rather than near-memory processing.”

Many of these problems have been worked on for some time, but they have remained somewhat hidden. “Wear level and garbage collection and everything that happens in a solid-state drive (SSD) is more complex than most people give it credit for,” adds Whitehead. “Many of these devices have a dozen or more processors in them. As more and more compute has been stuffed into SSDs, and not all of it used all of the time, what if we happen to make some of that processing power available? It is not a stretch that they start to use that in a different mode or add an applications processor and start running Linux.”

These approaches add new possibilities without disrupting existing compute paradigms. “The world of flash storage taking over in front of spinning media, and the concept that there is always a latency involved because of rotation, has helped a lot with the hardware guys being able to deliver software needs with new architectures,” asserts NGD’s Shadley. “Flash has enabled a lot of things, including in-memory and in-storage processing that could never had been done if we had stuck with hard drives.”

In fact, much of this is not new. But it does have to be updated in the context of new computing demands in AI/ML systems and at the edge. “You are talking about the equivalent of a microcontroller,” points out Intrater. “They exist today with way more memory in terms of die size than the processor consumes. You could call that processing within memory. Combining a processor and all of the memory required for an application, and putting them together on the same chip, has been around since the predecessors of the 8051. Clearly there is a disadvantage to this solution, which is that you can only processes as much data as can fit into the on-chip SRAM. Perhaps you just have to build it to fit the application.”

As with many advancements today, artificial intelligence (AI) and machine learning (ML) are spearheading the adoption of new technology. They are not bound by the legacy that holds other areas back. “ML algorithms require relatively simple and identical computations on massive amounts of data,” says Pranav Ashar, chief technology officer for Real Intent. “In/near-memory processing would make sense in this application domain to maximize performance power metrics.”

Along with the computation, other operations can be optimized. “These are engines designed to do a specific task, such as the matrix operations required for AI, and they come together with special DMA engines that just bring in the necessary information,” says Adesto’s Intrater. “These matrixes are huge but have significant parts of the array that are 0. You don’t want to multiply those, so you typically have smart DMA engines that just bring in the non-zero values. There are many optimizations being considered for processors that are specifically designed for AI.”

But there is only so far that technology such as convolutional neural networks can go, says Crossbar’s Dubois. “CNNs are important but they are not solving the main problem, which is data access. People are realizing that with computing and AI, it is all about how efficiently you can access data and bring it back to compute. This is a favorable trend because everyone realized that data access is the most important thing to solve if we want to become more energy efficient or to put AI to the edge.”

Heterogeneous thinking
To make use of processing in or near memory does require some changes, however. “On the highest level, you have to think about how you are going to spread and parallelize this job,” says Gilles Hamou, CEO for UPMEM. “You will have to think about data locality and parallelism — data locality because you have to associate data with a processor, and you have to understand how to parallelize the application. There is efficiency in sharing the work as opposed to organizing it. This is not like a GPU, which uses a SIMD approach. With that you have to not only parallelize but also to homogenize your computation.”

As systems become more heterogeneous, additional problems needs to be addressed. “A lot of the stuff is going on asynchronously, plus it is hard to overstate the heterogeneity,” says Whitehead. “I don’t see that changing. It is clusters of compute and little clusters of storage, and there will be software to manage it. But it is not homogenous. It is no longer just an addressable memory space. When you distribute computation to all of the nodes, the latency for getting the answer will be the highest latency of all of them. Previously, if the device was not doing garbage collection, it would give you an answer very quickly, but now the latency becomes a significant issue.”

That requires a rethinking of this entire process. “People are realizing that there are new and innovative ways to do stuff that doesn’t cost them, but it does require a willingness to make change and that heterogeneous can work in ways that were only done in a homogeneous manner in the past,” points out NGD’s Shadley. “Anytime new technologies come out you find the guys who say, ‘I know how this works, so I am not sure I trust this new stuff until you prove it to me.’ Even that is starting to go away to allow more heterogeneous types of architecture to become viable. You will always have the companies who are the owners of that homogeneous piece of the puzzle that don’t want it to change because their market has to change. But they are now realizing that they can allow it to adapt, and they have to adapt with it.”

Many products have failed because they forgot to consider the software. “Hardware guys look at the chip and they think about the possibilities of how they could use it,” says Amr El-Ashmawi, vice president of Cypress’ Memory Product Division. “The software team says, ‘This is how I want to do things.’ That creates a conflict. Product companies with embedded processors that are then opened sometimes forget that they have to go to the software team with a complete ecosystem — toolkits, SDK, drivers, a whole bunch of stuff — it is a different ball game.”

New verification challenges
Many of these new architectures require new verification techniques to go along with them. “With data locality, you have the question of coherency,” says Schirrmeister. “If you have different processing elements you have to figure out if they have something to talk to each other about through the memory. Then cache coherency becomes very important. When someone accesses the memory, they all have to decide who has the latest version of that element. In-memory processing adds an even more interesting facet to that because the processing on that memory comes into play as well.”

That is not an insurmountable problem. “Some of the memory is shared, some is tightly attached to each of the cores,” says Dubois. “We do have many more cores in the system and an increasing number of hardware accelerators that have some dedicated memory, and some of the data has to be shared between the cores. So it does add one more level of complexity, but it is not a revolution. Designers are used to handling many-core systems in their chip developments. That is just an evolution.”

Still, some demands are new. “We have to deliver solutions that enable them to measure performance and latency associated with these drives within 5% of silicon,” says Whitehead. “These are the types of problems that we need to be able to answer with our verification tools. We can see where the industry is going because we have to understand why certain things are so important to them. They know what they need, and they are very demanding of us to provide the tools that they need.”

One issue is getting the vectors to exercise them. “Looking at memory interfaces from a performance perspective becomes more important,” adds Schirrmeister. “The application-level performance analysis becomes even more important. How do you generate the tests for this? The only way many bugs manifest themselves is by somebody saying that this operation should be faster than it is. And then you have to analyze and debug – not at the signal level, not even at the transaction level, but more at the topology level – to figure out why the other processes are halting and the dependencies that exist around the system that indicate that the task was not parallelized properly or the pipeline was not properly defined.”

It is the temporal aspects of the system that are becoming the most important. “Different processing can occur in a temporally random fashion and, as such, verification solutions that depend on cycle-by-cycle stimulus and monitoring will not be effective,” says Dave Kelf, vice president and chief marketing officer for Breker Verification Systems. “Verification that relies on an overall intent specification that can handle unexpected concurrent activity in a range of fashions will be required. This will drive more test synthesis methods that can produce these test vector forms.”

Or perhaps we have to look at the problem differently. “Asynchronous interfaces are not an afterthought or an overlay on the core computation. They are ingrained into the core computation,” asserts Real Intent’s Ashar. “A bottom-up verification flow that validates metastability hardness and data integrity across said asynchronous interfaces could be deemed mandatory in such design paradigms.”

Conclusion
While most of the technical aspects of compute in or near memory are not new, their adoption will force significant change throughout the industry. The trifecta of von Neumann architectures, Moore’s Law and Dennard scaling are now collectively forcing the industry to change, and that in turn will impact the way that applications have to think about the hardware platform that they run on.

Many new devices are entering the market today, and most of them provide modest gains to existing software. But to get the maximum benefit, re-architecting the software will be required.

The charge will be led by system companies that control both pieces, and the rest of the industry will be allowed to make use of the new capabilities whenever they can. However, those who wait too long may soon become uncompetitive in the market.

Related Stories
Will In-Memory Processing Work?
Changes that sidestep von Neumann architecture could be key to low-power ML hardware.
Using Memory Differently To Boost Speed
Getting data in and out of memory faster is adding some unexpected challenges.
In-Memory Computing Challenges Come Into Focus
Researchers digging into ways around the von Neumann bottleneck.
Machine Learning Inferencing At The Edge
How designing ML chips differs from other types of processors.



4 comments

peter j connell says:

As a newb, I find it very odd that I had pondered similarly on the possibilities of the quite powerful processors on nvme controllers.
It occurred to me that each nvme is potentially able to perform independent processing on its locally stored data.

If AI etc. is going to have to sift through petabytes of metadata, it will be the lag, limited bandwidth and power usage of moving that data between processors which will most taxing. Doing ~some processing in situ w/ storage would be a boon.

Andy Walker says:

Great article

BillM says:

Remote data processing has many upsides concerning power consumed (just in transmitting a full, un processed data stream), time to transmit and even a security aspect processed vs raw data provides additional security). The one downside is incorporating more processing capability in the fringe but integration is a well, hone process.

I expect distributive processing to be standard in all products in near future.

Kevin Cameron says:

You can tweak the SMP architecture to make it do in-memory computing – solves most of the problems.

The other problems are fixed by dropping RTL and going asynchronous along with photonics for communication.

At this point the only missing piece is micro-power tunable lasers.

Leave a Reply


(Note: This name will be displayed publicly)