New Power, Performance Options At The Edge

Tradeoffs grow as chipmakers leverage a variety of architectural choices.


Increasing compute intelligence at the edge is forcing chip architects to rethink how computing gets partitioned and prioritized, and what kinds of processing elements and memory configurations work best for a particular application.

Sending raw data to the cloud for processing is both time- and resource-intensive, and it’s often unnecessary because most of the data collected by a growing number of sensors is useless. Screening that data closer to the source is much more energy-efficient, but even there efficiency can be improved. So now chip architects are looking at a slew of other options, including what data gets processed first, where it is stored, and how close to various elements in the memory hierarchy that processing needs to happen.

While there is still a need for general-purpose processors, like a CPU or APU, specialized accelerators can add orders of magnitude improvements in performance and energy efficiency. But those compute elements increasingly have to be viewed in the context of a larger system. So rather than just building a faster processor or accelerator, or widening the path to memory, chipmakers are focusing in on how data moves, the underlying architecture for actually moving that data, and what is the optimum balance for what gets processed, when it gets processed, and where that processing occurs.

One of the big knobs to turn involves how far data needs to travel between memory and various processing elements.

“We’ve done a lot of work over the years to ensure that data is kept on the CPU, and within the system where it’s important,” said Peter Greenhalgh, vice president of technology, and fellow at Arm. “This extends not just to our CPU elements, but through to our interconnect (Coherent Mesh Network) that we build — particularly for the infrastructure space where we can ensure that data is kept on chip rather than being moved off chip, where it can be more costly to access. This extends to the work that we’re doing at the moment for looking at 3D implementation. With 3D scaling, we’re extending our CMN to cope with 3D architectures so that you can have memory on a chip above, and then interface it in the most efficient way, but also lower power than going off to DDR.”

Moving processing directly into memory adds another option. Scott Durrant, strategic marketing manager at Synopsys, pointed to three ways to implement computational storage solutions — computational storage arrays, computational storage processors, and computational storage drives. Each comes with its own set of tradeoffs.

A computational storage array is an accelerator module with an integrated switch, which provides connectivity between the host CPU and traditional SSDs. “This approach is power-efficient because there is only one storage processor consuming power, and it is convenient and cost-efficient because it enables the use of conventional SSDs,” Durrant said. “While this approach delivers on computational storage’s benefit of reducing the amount of data that must be exchanged between the host CPU and storage, it does not reduce the data or associated power passed between the SSDs and the storage processor.”

A computational storage processor, in contrast, is connected to the PCIe bus as a peer to conventional SSDs.

“This approach eliminates the power and latency of the switch in the computational storage array, but requires more data movement over the processor’s PCIe bus and does not scale efficiently with the amount of storage in the system,” said Durrant. “The storage processor chip architect will need to consider a number of factors to decide where to place the compute function. For example, the computational storage drive approach may make sense if the storage processor architect has access to, or perhaps can integrate, the flash controller. This approach is also particularly interesting if the target application lend themselves well to parallel processing. Other approaches may be preferable if the architect needs to deliver a product that leverages the economics of legacy flash drives and/or does not provide the inherent scalability of the computational storage drive approach.”

With a computational storage drive, the device is connected to the host CPU via NVMe. These drives have their own built-in storage processor, delivering processing power that scales with the number of drives. The advantage of this approach is that it minimizes storage latency by reducing the number of external components. It also minimizes data movement because it can filter and process data before it leaves the SSD.

Fig. 1: Computational storage options. Source: Synopsys

Partitioning momentum
There are other ways to minimize movement, as well. Breaking apart chips into separate dies or chiplets has been gaining steam ever since Apple opted for a fan-out in the iPhone 7. Since then, Intel and AMD have each created a chiplet model, and systems companies have embraced some form of advanced packaging. While the initial applications were largely a way of cramming more compute horsepower into a package, chipmakers are beginning to leverage advanced packaging to add more customization.

One of the initial considerations behind advanced packaging is that it enabled various elements of a chip to work together, no matter what manufacturing process they were developed with. But as various elements are stacked up, this approach actually can shorten the distance that signals need to travel compared to routing a signal from one end of a large SoC to another. As a result, a signal can be driven using less power and with greater speed than over a long, thin wire.

“3D stacking and the high bandwidth that you can get interfacing between memory, or in fact different processors on different stacks, in a 3D die is going to be a super important piece of technology for delivering more and more performance with each process generation,” said Arm CEO Simon Segars. “People talk about the end of Moore’s Law ending the transistor scaling but that move into three dimensions is going to be one of the areas in which that helps performance just keep improving generation after generation. We’ve done a number of research projects. We’ve done partnerships with some of the leading foundries on how to make that easy to use because there’s a lot of physical implementation and manufacturing issues that come in when you get into 3D stacking. That’s going to be an exciting development, yet to come, as the architecture evolves.”

The key here is to figure out what works in multiple dimensions and what doesn’t. This is particularly important as computing intelligence moves closer to the edge, which remains a vaguely defined region between the end point and the cloud.

Machine learning is a classic example,” said George Wall, director of product marketing for Tensilica Xtensa Processor IP at Cadence. “It started initially as doing object detection and classification for certain applications. And now, machine learning has branched out into doing things like processing radar inputs, processing voice commands, processing network traffic even. In order to meet the computational needs for machine learning-type algorithms, as well as the throughput required, usually the computation has to be moved out closer and closer to the data source.”

There is much engineering work yet to be done, both in software and in hardware, in order to make all of this work as expected.

“A machine learning engine that is targeted toward object detection, for example, would not be energy-efficient if it was trying to do something in the storage space,” said Wall. “You would need to have an optimized engine for that particular application that just naturally wants to sit closer to the data source. The further and further away you get away from the data source, the more it impacts your latency because you have to transmit that data over to the engine. And then, as the engine makes decisions, it has to broadcast those decisions back to the source. There’s a round trip delay there. The more you can reduce that round trip delay, the more you can reduce the latency.”

Saving energy
Moving data on and off a chip is still a challenge for energy-efficient designs. With a graphics or machine learning chip, this is particularly evident because by nature these are data-intensive applications.

“Re-using data has been something we’ve been worrying about for a long time,” said Segars. “We have made some innovations over the years in our graphics architecture specifically to address that. Shoveling data on and off chip is very power-hungry. Moving it from main memory onto the chip is very costly from an energy point of view. We’ve done lots of things over the years to look to bring data into the chip, and use it as much as we can before we move on and replace it with other data. This is one of the reasons we increased the data size in SVE2 (scalable vector extension 2). The more data you can hold in your hand and process before you have to go and fetch the next set of data, the more energy efficient the system can be.”

Compression and energy efficiency
Jem Davies, vice president and general manager of machine learning, and fellow at Arm, said this issue has been at the forefront for some time. “Two of the big data plane problems we work on are graphics and machine learning, both of which require very large datasets, so concentrating on the compute versus load store energy use is something we focused on from the very beginning,” Davies said. “The fact that we use compression in several different places in the GPUs, and the way we’ve moved on to do that with our neural network processes, is a reflection of that. We can compress it and decompress for less energy than it would take to load and store it uncompressed. It also affects our cache architecture, as well.”

Compression plays a key role in the energy efficiency equation. While energy is consumed in encoding and decoding, there are overall savings from sending compressed data. The good news is this is a relatively simple improvement that is outside the design of most IP.

“It really doesn’t impact what we’re doing inside that IP, but it does impact the configuration,” said Ashraf Takla, CEO of Mixel. “From a system perspective, do you use four lanes at high frequency without compression? With compression, you need less bandwidth. So do you reduce the speed, or do you, for example, reduce the number of lanes? Typically, the latter is a better solution. Instead of running at lower speed, you run at the full speed but you reduce the number of lanes. That not only saves power, but it also saves pins.”

Others agree. “I’ve seen people making architecture decisions where compression is used to reduce the amount of data you’re transferring — 3X compression makes effectively less data being moved around, whether it’s moved between chips or within the chip. Alain Legault, vice president of IP products at Hardent. In both cases, you can get tremendous savings. When you have lots of data processing, even inside the chip, if you have a unified memory architecture, if the images are actually compressed, you get tremendous savings. It may mean that you divide by two the width of the memory architecture.”

That also helps with the loading and storing of data. One of the key metrics in Arm’s NPU designs is the number of times that a particular piece of data will get loaded or stored — especially when loaded from main memory and then used. “It’s something we track very, very closely in the simulation and prototyping of our designs, and intelligent movement of that data,” Davies said. “One of the reasons we have numerous little processors inside our other processor designs is precisely to drive that. The actual intelligence of moving that data is crucial to overall energy usage, which is one of the critical factors of an NPU. Why would you use an NPU rather than a CPU? Because it’s more efficient.”

Process more, move less
The general themes here are similar, although the actual approaches can vary greatly. The basic idea is process more data locally, move it only when necessary, and make that movement as efficient as possible. But this can be augmented with other savings, as well.

“In both storage and memory, the act of moving data all the way over to a processor, much like a like a CPU, takes a lot of energy,” said Steven Woo, fellow and distinguished inventor at Rambus. He pointed to Marco Horowitz’s keynote presentation at the 2014 International Solid State Circuits Conference, where Horowitz debunked the prevailing idea that computing was the bottleneck. Horowitz showed (see fig. 2 below) that’s not true. The energy is spent preparing to add data, and getting the data you need for that operation.

“A ridiculous percentage of the overall power is spent in just setting up the computation and moving the data,” Woo said. “What’s more striking about all of this is that Horowitz has these parts of the bar that he’s labeled. He worked through the numbers, getting the instructions and data out of memory.”

Fig. 2: Data access and movement dominate power consumption. Source: Rambus/Mark Horowitz, Stanford University, “Computing’s energy problem (and what we can do about it),” ISSCC 2014.

Still, there are tradeoffs with moving data off-chip. “If I have to go off chip to get my data, one of the things I have to worry about is all the time and energy spent getting my data,” he said. “The same is true for storage. And with storage, the numbers are even worse, because you’re moving even further away to get the data. And if you’re talking about a hard disk, there’s mechanical rotation involved, which takes a lot of energy. From an energy perspective, you might think, ‘If this bar is so dominated by moving data to the compute engine, why don’t I put the compute engine right next to where the data is, maybe even put it into the media, which is storing the data?’ We talked about either near-data processing, or in-memory computing. And that’s just this notion of, ‘I’m going to put the compute just as close as I possibly can.'”

Since then, processors and memories have become smarter. Previously, when a processor asked for data, neither the memory nor the disk recognized it as anything other than bits, and it was the job of the CPU to interpret those bits.

“If you’re going to do the compute at the storage, or at the memory, things change now,” Woo said. “The storage and the memory have to have some idea of what those bits need in order to process them. If you’re thinking about searching, are those bits integers? Are they images? What exactly are those bits? They have to become semantically aware — basically, they have to understand the semantics behind the bits so they can do the processing. It gets a little harder. And it requires some interaction and some agreement between a host and CPU, and the compute that’s happening near or in the storage or in the memory. It’s a change in the model. But I don’t think anybody disagrees that you can get some tremendous performance and energy gains, if this model makes sense for the type of work you’re trying to do.”

What’s next
As chip architects evaluate where to put the compute, Cadence’s Wall suggests they consider the time budget they have in order to do the necessary computing, as well as the power budget. Can they reduce the time required while staying within that power budget?

“If you had a scalable learning architecture, you could look at different points and variable solutions. You could start with the lower end of the of the scalable solution, lower performance, lower power, and see if that meets your computational needs. We provide tools like instruction sets, simulators and SystemC modeling that allow our customers to do that evaluation. And they could easily see if that meets their budget. Maybe a particular solution meets their budget with so much to spare that they want to consider scaling back even more. Or they may find out it doesn’t meet their budget, and they need to go up to the next level of the performance curve. They need to look at the next higher level.”

Others agree. “For each architectural implementation, the chip/system architect will also need to carefully assess the class and performance of the processor(s) involved,” said Rich Collins, Synopsys product marketing manager. “It is not likely a ‘one-size-fits-all’ decision. Each computational storage implementation brings a different set of processor performance, power and area considerations that need to be addressed.”

At the end of the day, all of these limit-pushing considerations must come into play and go into the very foundation of the design. This is where designers create their own ‘secret sauce’ in product development.

“Moving that data around, caching it at the hierarchy of caches, how that impacts the system side – that’s all really important,” said Arm’s Segars. “This is where architecture and design philosophy go beyond just the instruction set and into the structure of the devices that people are building.”

Computing Where Data Resides
Computational storage approaches push power and latency tradeoffs.
Challenges In Developing A New Inferencing Chip
What’s involved in designing, developing, testing, and modifying an accelerator IC at the edge.
Shifting Toward Data-Driven Chip Architectures
Rethinking how to improve performance and lower power in semiconductors.
Making Sense Of New Edge-Inference Architectures
How to navigate a flood of confusing choices and terminology.
Edge-Inference Architectures Proliferate
What makes one AI system better than another depends on a lot of different factors, including some that aren’t entirely clear.
Moving Data And Computing Closer Together
This is far from simple, but the power/performance and latency benefits are potentially huge.

Leave a Reply

(Note: This name will be displayed publicly)