Computing Where Data Resides

Computational storage approaches push power and latency tradeoffs.


Computational storage is starting to gain traction as system architects come to grips with the rising performance, energy and latency impacts of moving large amounts of data between processors and hierarchical memory and storage.

According to IDC, the global datasphere will grow from 45 zettabytes in 2019 to 175 by 2025. But that data is essentially useless unless it is analyzed or some amount of compute is applied to it, and moving that data to the CPU takes more energy that the compute itself. Approaches such as computational storage attempt to mitigate those issues.

Numerous comparisons have been made between the value of data and oil. “Both data and oil are reasonably useless unless you do something with them,” said Kartik Srinivasan, director of marketing for Xilinx’s data center group. “Raw oil cannot be put in your car. You need to have some process applied to it to make it useful. Data is even more challenging, because the latency with which you make the data available after your analysis is done is extremely critical. When a stock broker gets a piece of information saying, ‘This is the trade that the engine is recommending,’ if it comes five minutes late, it’s useless. So the value of data diminishes if the analysis on it is not being done with the right latency.”

All types of data sharing and streaming have been driving up the amount of data that is generated, exchanged, and shared. The pandemic has only exacerbated the demand for higher data center capacities, network speeds, storage capacities, and performance, observed Scott Durrant, DesignWare IP cloud segment marketing manager at Synopsys. “With COVID, that is being driven up exponentially, and it’s expected that the changes in the way we work, the way we learn, the way we interact, the way we find entertainment, are permanent.”

As a result, SoC designers creating devices meant for these high-speed, low-latency applications are beginning to consider alternative architectures, such as computational storage.

“As we see more control systems come online, that criticality of low latency is going to increase in importance,” Durrant said. “Along with this, another point of focus for data centers is optimizing energy utilization. There’s a big drive right now toward net zero carbon footprint for data centers, which is actually a huge challenge, because they are large consumers of power today. Every element in the data center is going to come into play around that. The data center is shifting toward what traditionally has been a mobile device architecture when it comes to optimizing for power. Mobile devices for years have been trying to maximize battery life by minimizing power consumption and shutting down pieces of the device that aren’t in use at given point of time. We’re seeing similar implementations in the data center today in order to maximize power efficiency. Also, new processor architectures have been introduced that historically have been used in in mobile devices. Arm processors, for example, are now targeting the data center infrastructure. And Arm and RISC-V processors are sufficiently open that you can optimize them for a particular workload.”

Fig. 1: Computational storage impact on energy. Source: Synopsys/21st IEEE Conference on HPCC, Aug. 2019

This is where computational storage fits in. “The idea of doing compute over data has been going on since time started,” Srinivasan said. “But now with the digital transformation, there is reasonably priced hardware to do the analysis, and the software framework that allows the analysis to be done. Those components are coming together in a good way. The data is digitally ready. The hardware is affordable. The software framework is available to apply.”

This is harder than it looks, however. “There will be disks and disks full of data, and you may only need one or two pieces of it, but all of it has to be searched,” said Steven Woo, fellow and distinguished inventor at Rambus. “The conventional way to do this is to take everything in the disk, transfer it to a CPU, and then the CPU is going to search through everything and throw away 99.999% of all that. A lot of the work it’s doing is actually wasted. Alternatively, there may be an array of disks, and the system is set up to transfer all the data in parallel, so it comes faster. But in the end, there’s still just one CPU searching for the data, which is the bottleneck.”

This is where computational storage really shines, Woo said. “What if each of the disks had a little bit of smarts in them? Now it’s possible to say to all of the disks, ‘Search in parallel. I’m going to say go, and each of you is going to scan all of your information. Only send me back what matches my particular request.’ What’s interesting there is I don’t waste bandwidth and energy moving data that I’m never going to use. It all stays local, and I only get to see the stuff that matches my criteria.”

Broad applications
Computational storage is beginning to gain traction in the edge, a somewhat murky hierarchy of compute resources that spans from the end device to various types of servers, both on-premise and off-premise. The goal in all cases is to do as much computing as close to the source as possible, limiting the distance that data needs to travel.

“Today what happens is that data get transported, even if it’s generated at the edge, whether it’s surveillance images or license plate readers,” said Neil Werdmuller, director of storage solutions at Arm. “Typically, all of that data stream is sent to a central server somewhere, and then it’s processed. What it’ll pull out from the data stream is the occasional license plate number for example, and that just seems crazy. If you can just do the processing at the edge, send the insight or the value that you’ve generated where it needs to go, that makes much more sense.”

Fig. 2: Computational vs. traditional storage. Source: Arm

Another example could be in a 5G telecoms setting with computational storage implemented at the cell tower. “If a vehicle was coming through that particular cell area, they could have high-definition mapping, with the tile that’s needed stored where it’s needed, instead of every vehicle that comes through that cell area having to download the same thing from somewhere central,” Werdmuller said. “All of that backhaul is expensive, takes energy, and adds latency, because often when you’re moving data in this way there’s latency involved.”

There also are security and privacy benefits to processing that data at the edge.

But not all data can be collected and processed at the edge, and in the cloud there is a huge opportunity to utilize computational storage, as well.

“Managing all of that is really challenging, and often the number of workloads is exploding,” Werdmuller said. “Having all of those managed on a server, and then having to pull the data into there is complex. It adds power, it adds latency, so if you can have storage that’s got particular data stored on it, you can have the particular workload you’re going to use on that data. For instance, if you want to do machine learning on a whole range of photographs that do facial recognition, if you know where those images are stored on these drives, then doing that on the actual data makes a lot of sense. You have to manage fewer workloads centrally and are able to distribute workloads.”

All of this is becoming interesting to companies because of the energy and time required to move data, and the rising volume of data that needs to be processed quickly. “If it didn’t cost much energy and time, this wouldn’t be a problem,” Woo said. “Once you have this ability to only send back the information that really matters, maybe you can even do some simple compute in the storage, as well. But once you have that ability to send back a smaller amount of much more meaningful data to the CPU, then the CPU is going to try and hold on to it as long as it can. It can do machine learning techniques, where it holds the data and tries not to move it. What you’re hoping in all of this is to minimize the movement of data at the disk. Then, once you get something more meaningful, you can send that back to the processor so there’s no wasted bandwidth, because you’re only sending the meaningful stuff. Then the processor will hold on to it as long as it possibly can and try not to move it around. All of that is designed to minimize data movement.”

SSD to CSD evolution
From a design perspective, the path from solid state drive (SSD) to computational storage device (CSD) is an evolutionary one.

“Up until about 2019, we were solving one of the biggest problems with emulation that users were trying to work with in SSDs, i.e., measuring performance — specifically the IOPS (input/output per second) — and the bandwidth and the latencies,” said Ben Whitehead, solutions product manager, emulation division at Siemens EDA. “The latencies were a big problem with other kinds of verification because they weren’t very accurate. You could get a functionally correct solution, and it would be, ‘Wow, that’s great,’ and then you’d get to tape out and get in the lab, and sometimes its performance would be off by orders of magnitude. It was embarrassing how badly it would perform. It worked, but we weren’t able to really accurately measure the performance pre-silicon. That’s how difficult it is just to do an SSD. But an interesting by-product of all that SSD work was that SSDs started having multiple processors in them — real-time processors. I worked on designs with eight fairly large microprocessors in the design, and in a digital system where your storage is off, it delivers that data really quickly and then it kind of goes idle. Most of the time those processors sit idle, and when they’re needed, they’re really needed. Like a fighter jet pilot, it’s four hours of boredom followed by 15 seconds of terror. That’s what drives do. They sit around waiting to just blast data back and forth. With all that processing power around it, it made logical sense to do something more with it.”

CSD addresses the the movement of data. “The L2 caches are just bulldozed constantly by data requests for the CPU on the motherboard,” Whitehead said. “When the processor is spending all that time moving data back and forth, it raises the question, ‘What are we doing here,’ and the realization that a lot of that processing could be done somewhere else. That’s where CSD really gets its traction is that we can use some of that processing power on the drive itself.”

Others agree. “Ten years ago solid state drives were new,” said Kurt Shuler, vice president of marketing at Arteris IP. “There really wasn’t anything like an enterprise SSD. There were little microcontrollers running on platter-type hard drives. That was where semiconductors were then. Since that time, so much has changed. A lot of startups were doing really sophisticated SSD controllers, and the problem initially was that NAND flash consumes itself while it’s operating, so you always have to check the cells. Then, once you find out they’re bad, you must rope them off and tell them not to save anything there anymore. If you buy a 1-terabyte SSD drive, it actually has more than 1 terabyte because it’s grinding itself to death as it operates. For the SSD controllers, that was the initial challenge. But now, storage disk companies have undergone a lot of consolidation. If you look at what’s going on computational storage, we have customers who are doing SSD storage and controllers for the data center that are focused on a particular application, such as video surveillance, so there is computation actually within those controllers that is dealing with that particular use case. That is completely new. Within that computation, you’ll see things like traditional algorithmic, if/then analysis. Then, some of it is trained AI engines. Any of the SSD, enterprise SSD controllers are heading in that direction.”

This is beginning to reshape the competitive landscape, particularly as traditional storage companies such as Western Digital begin designing their own hardware accelerators.

Understanding computational storage
From a technical standpoint, the concept is relatively simple, said Marc Greenberg, group director of product marketing, IP group at Cadence. “Take a simple counter or accumulator function in software, x = x + 1. If x is something we’re going to use a lot, it would probably get stored in a cache or scratchpad on our CPU die anyway. But let’s say we’ve got a whole lot of these counters or don’t use some of them very often. Then, some of them may be stored out in external memory, let’s say DRAM. When it comes time to do our x = x + 1 operation, we’ve got to activate the page in memory, read x from DRAM, take xX to a processor, add 1 to it, and then write it back to DRAM. Depending on how long this process takes, we might need to pre-charge the page in memory and then activate it again later when the write is ready. All of this takes energy, both moving the data between the DRAM to CPU to DRAM, as well as having the page in DRAM activated and using current.”

If the memory device had a simple logic element in it, a transaction could be sent like, “Add one to the contents of memory address x, and return the result,” Greenberg said. By doing this, it cuts the energy required to move the data across the interface in half, reduces the amount of time that page in memory has to be active, and offloads the CPU.

From that simple example though, it can branch into a million options. “Should it be generalized to any x = x + y? What if it overflows? Should it be able to subtract too? How about multiply? What about other basic ALU functions like compare, shift, Boolean functions? Eventually it becomes a secondary CPU in the memory, not necessarily a bad thing, but not being done on a general-purpose basis today. At least in the short term, it seems that special function processors in memory may be the way to go, there are companies out there doing certain directed AI functions in the memory for example, by building custom memory devices with the AI math processing functions built in,” he said.

Issues to iron out
But to make this work, logic must be added inside the memory, said Darko Tomusilovic, verification lead at Vtool. “And you must add some controllers as part of the memory block, part of the memory logic. There is a difference from having a split between the processor and the memory. Now you will put a piece of logic inside the memory itself. And when you verify the memory controllers, you will have take into account that it’s not just a stupid, dumb memory controller. It also can do stuff. This violates the whole concept of verification. Before we had, let’s say, to test the memory subsystem, then integrate the memory subsystem into the full chip environment. It was more or less common methodology to run software test cases only as part of the full chip. Now it becomes much more interleaved. In that sense, we see that there is a huge demand for specific engineers who will only target memory controller verification, so now it almost becomes a separate profession. As a services company, we see a huge demand specifically for that.”

At the same time, there are missing standards on all levels for such approaches, noted Andy Heinig, head of department Efficient Electronics at Fraunhofer IIS’s Engineering of Adaptive Systems Division. “Programmers want to program in frameworks, meaning high-level programming based on libraries (such as Tensorflow for AI programming). Software is also programmed in a lot of different levels of abstraction — low-level driver programming, first-level libraries, higher-level libraries based on the former one, application programming. It may be possible to capsulate the in-memory computing on the driver level, but we assume the full potential of the approach can’t be realized this way. Exploiting the full potential is only possible if you have access to the mechanism through all the programming levels, because then the application algorithm can use the accelerated data computation directly. But to get it through all the programming levels also means that standards on all programming levels are necessary to get compatibilities between different libraries. To realize such a programming on the framework level, a lot of lower levels must be realized. If it has to be done new for each architecture, it is time- and resource-consuming. So standards on each level, such as hardware, software driver level, must be established.”

Many CSD implementations utilize an applications processor inside the drive rather than a real-time processor. “Applications processors have a lot different requirements and handling than a real-time processor,” noted Siemens EDA’s Whitehead. “They’re completely different. To account for this, some processor providers, like Arm, have created processors specifically for computational storage with multiple cores. That combines the elements of applications processors with real-time processors, and so processing can be done in either mode.”

Whitehead said most engineers with experience in real time processors, SSDs and controllers are used to how this works. “With an Arm processor, you have firmware that runs on it, doing writes and reads and garbage collection and all the things that you want on an SSD. But now you’re adding a whole Linux stack inside your DUT, and that has some real ramifications in the design and in the verification, because now you have an actual full blown system in your drive. It looks like a computer, and that means different things. It still has to look like an SSD from a host perspective when you plug it in. It’s still going to say, ‘I’m a drive, and I can do all your storage needs.’ But if the system is aware that it also has a Linux system, you can SSH [a Linux network protocol for securely communicating between computers] into that system, and then it looks just like a headless server running inside that drive.”

That brings up a completely different set of verification standards when considering that latency and performance is the biggest struggle.

“Now you’ve added an applications processor in your drive,” he said. “How do you measure that? What happens if it’s working and you need to reboot into those real time modes? Now you’ve really messed up your latency and you’re going to get really skewy numbers. You need to account for that.”

He’s not alone in point this out. “Fundamentally, if you go to the Linux ideal, there is no difference apart from the amount of compute,” said Arm’s Werdmuller. “If you look at an Arm smart NIC, for example, or an Arm-based server or an Intel Xeon server, because it’s got so many workloads and it’s managing so many things, typically the compute is very intensive. For some workloads that can be the right approach. Of course, in other cases, actually doing distributed compute — where each drive does little bits instead of shifting it all to a really powerful compute that takes huge amounts of power — there are benefits to doing it locally. There is a power limit, typically up to 25 watts in a PCIe slot, but 20-plus watts of that is used to power the NAND and the RAM in the device, so you’re left with up to 5 watts for everything else in the compute. It seems low, but you can still do a lot with that level of compute.”

With more compute moving into memory, other parts of the compute architecture are changing, as well.

“Our old definition of ‘server’ used to be that there is a CPU sitting in it, and it’s responsible for everything that happens,” said Xilinx’s Srinivasan. “All application processing, all the data is being done by the CPU, and the rest of the peripherals are pretty much just responsible for getting the data into the CPU, storing it, or letting the scratchpad memory access the data for whatever the reason intermediate compute.”

Now there is a sharing of responsibility as the industry is slowly beginning to evolve, he said.
“From the hyper-scalers down to the down to the enterprise, they’re all embracing the fact that not all workloads are created equal. You want to be able use the CPU for the right job, GPUs for the right job, and FPGAs for what they’re capable of. For that, it means using the right tool for the right job, which is not an easy task because there’s no such thing as a typical data center. For this reason, computational storage concepts will be applied at the drive, processor or array levels.”


Kevin Cameron says:

In-memory computing goes back to the 1970s –

The Transputer was an implementation in the 1980s.

What has stopped it happening is the separation of CPU and memory because of DRAM being on a different Silicon process. However, now we have die-stacking that’s less of a problem.

Current computational storage is more like distributed computing since the PCIe cards are essentially computers in their own right.

Ann Mutschler says:

Thanks for your comment, Kevin. Good point. While obviously not a new idea by any stretch, it’s yet another way that compute can be customized, and I think that’s the interesting bit! There are so many ways to approach the system architecture today — sorting through the tradeoffs of each requires architects and designers with a lot of knowledge, insight and experience.

Anon says:

Hi Ann,
Which HPCC paper is the first figure from?

Leave a Reply

(Note: This name will be displayed publicly)