More Data Drives Focus On IC Energy Efficiency

Decisions that affect how, when, and where data gets processed.

popularity

Computing workloads are becoming increasingly interdependent, raising the complexity level for chip architects as they work out exactly where that computing should be done and how to optimize it for shrinking energy margins.

At a fundamental level, there is now more data to compute and more urgency in getting results. This situation has forced a rethinking of how much data should be moved, when it should be moved, and how much energy is taken up with various functions that are sometimes separate, sometimes dependent on one another, and often prioritized in different ways.

Formulas can vary greatly, depending upon where data is processed. For example, moving data around inside a data center consumes an estimated 10% to 40% of total data center energy, and that percentage is expected to grow as the amount of data that needs to be processed continues to skyrocket.

“There’s been significant growth in data over the years for some time now, and the networking of devices has had particular impact” observed Scott Durrant, cloud segment marketing manager for DesignWare IP at Synopsys. “As we put more and more devices on the network, such as videos security cameras, traffic cameras, manufacturing control systems, and the like, these devices are driving up the network data traffic, and therefore data movement, in very significant ways.”

Every minute of every day, more than 500 hours of video are uploaded to YouTube, and every day more than a billion hours of video are streamed from YouTube. Streaming video and social networking are two of the key drivers of network traffic, Durrant said.

But it’s no longer just a data center issue. The whole edge buildout is happening because it takes too long to get results if all the data is sent to the cloud, and it takes far too much energy to drive signals back and forth. Processing that data closer to the source can have a big impact on energy consumption, performance, battery life in mobile devices, as well as a host of physical effects that can impact everything from circuit aging and signal integrity to the overall competitiveness of end products.

“As compute engines get faster, such as for AI accelerators, they need more bandwidth, and there’s only a couple of really good memory solutions that will work,” said Steven Woo, fellow and distinguished inventor at Rambus. “That means when you do have to move data, you want it to be as power efficient as possible. Once you get the data on your chip, you try not to move it around at all. What some architects will do is keep some of the data stationary, but change the computation that’s going on around the data because it’s easier to do that. It’s better to have those compute resources all in one place, not to move the data around than it is to have unique and dedicated resources and to be shipping the data back and forth between little compute units.”


Fig. 1: Data access and movement dominate power consumption. Source: Mark Horowitz, ISSC 2014/Rambus

In the bar chart above, while most of it is blue, the red bit represents the total energy to add two numbers together. The blue parts are all the energy to fetch the two numbers, move them to the compute engine, and control the add operation. Woo noted, “The interesting part is blue section of the bar called ‘Register File Access.’  This shows the energy to access the data if that data happens to be on chip in a register (6pJ, or picojoules). In the rightmost table above the bar, you can see how much more energy it takes if the data you’re adding together happens to be stored somewhere else, like a cache (a range of 10pJ to 100pJ depending on the size of the cache).  The one that’s a bit surprising is if the data happens to be in DRAM – it’s 1.3 to 2.6 nanojoules (nJ), which is approaching 1000 times more energy. So if the data happens to be in DRAM, then that second segment of the bar chart would go from 6nJ to ~2pJ, or roughly 333x longer its size. The data movement energy would swamp all other energy. That’s why accessing DRAM in an intelligent way is so critical – it can’t be avoided, but once you do get data from DRAM, you need to make sure you reuse it as much as you can to amortize the high energy required to access it.”

Moving data costs energy and takes time, so processing that data in place and getting rid of what isn’t useful can have a big effect on performance and power. “Once you have that ability to send back a smaller amount of much more meaningful data to the CPU, then the CPU is going to try and hold on to it as long as it can, and can perform techniques like weight stationary or something analogous, where it holds the data and tries not to move it,” he said. “What you’re hoping in all of this is to minimize the movement of data at the disk. Once you get something more meaningful, you can send that back to the processor so there’s no wasted bandwidth because you’re only sending the meaningful stuff, and then the processor will hold on to it as long as it possibly can and try not to move it around. All of that is designed to minimize data movement.”

This is a big issue for automotive applications, for example, because of the volume of data generated by sensors. In an ADAS system, a lot of sensor inputs must be accounted for, and some of that includes streaming video that needs to be processed for things like object detection and classification. A lot of data has to go in and out very quickly.

“Moving that data takes up a lot of power between the processor having to issue the directive, the instruction to move the data, as well as the toggling of the interconnect lines,” explained George Wall, director of product marketing for Tensilica Xtensa Processor IP at Cadence. “Plus, there’s always some overhead in terms of the processor having to wait for that data. But that data is being transferred, so in a way it’s almost like the energy that’s being used isn’t really being put to good use. There’s not a lot of processing happening when the data is being moved. It’s just waiting for the data.”

These issues were manageable in the past when data volumes were smaller, but they have been steadily growing as compute-intensive applications such as AI and data analytics become more pervasive.

“The problem has existed for quite some time, but now we are seeing some solutions come to light,” according to Anoop Saha, head of strategy and growth at Siemens EDA. “The key questions are how it impacts the design of a chip, and how it impacts the way the system architect is thinking about going through that process. If you look at the SoC, and the system architect point of view, the data is captured from one source. For example, in-car data is coming from cameras and other sensors. In a smartphone, data is coming from the Internet or something that the user is doing. Once data is captured, it goes from that place to a storage system, and then data is computed on the compute engine. In this case, the data is moving across multiple stages. There is the energy spent on moving the data from the source of the data to the storage, and then from the storage to the multiple layers of memory that you have in the chip even before you do the actual compute in that unit. The energy efficiency is very different when it comes to moving the data based on how much data is moved from the on chip memory — L1 cache to the compute unit, L2 cache to the compute unit — versus how much data is moved from the off-chip DRAM to on-chip compute. There is an order of magnitude difference in that, but you cannot have everything in one chip. At the same time, on-chip memory has limitations. It’s expensive, you cannot have DRAM in the SoC. You have to balance between how much you can store off-chip, and how much you can store in the different levels of cache, and how to do the compute.”

The balancing act with these tradeoffs is compounded by memory considerations. “In the next wave of innovation, memory demand is being driven by big data applications across all end market segments — automotive, 5G, AI,” said Anand Thiruvengadam, director of product marketing in the Custom Design and Verification Group at Synopsys. “These are the primary drivers. A corollary is the emergence of big compute. There is big data. What do you do with it? You have to compute, so you need big compute. These applications have inspired the emergence of newer architectures on the compute side. For example, not just innovations on CPU architecture, but now GPU, which is now a mainstay for acceleration. Also, the emergence of processors like the DPU for data center specific compute.”

That push toward customization pays significant dividends in terms of power and performance. “In the AI and ML, and even the HPC markets, the advantages of having customized hardware to run your particular algorithms or your particular architecture are so great that they they actually don’t want to use standard off the shelf hardware if they don’t have to,” said Marc Swinnen, director of semiconductor product marketing at Ansys. “So in those market, or in big markets driving a lot of semiconductor design these days, they still want customized hardware architectures.”

That, in turn, opens the door for different memory options. “As big compute rises up to the challenge, it points to the demand for memory,” Thiruvengadam said. “With big data applications, as more data has to be crunched, traditional Von Neumann architectures are the bottleneck. You now have to move a lot of data back and forth between memory and compute, and so the energy that’s spent on it is becoming a limiting factor. What has accelerated this is the fact that Moore’s Law scaling has slowed down, and the shift to the next node is not only more expensive, but it is not as power-efficient. All of this is leading to the emergence of new compute architectures, and memory architectures such as at-memory compute and in-memory compute, which essentially have become mainstream primarily for AI.”

Changing where processing is done
On a macro scale, concern over energy and latency has created a huge opportunity to do more computing at the edge using specialized architectures.

“Instead of moving a lot of data long distances, you try to effectively partition the processing, do enough processing at the edge, so the data you send is much more optimized for what you really need to send,” said Ashraf Takla, CEO of Mixel.

But even at the edge, changes are needed. “Shoveling data on and off chip is very power hungry,” said Simon Segars, CEO of Arm. “Moving it from main memory onto the chip is very costly from an energy point of view.”

The challenge is figuring out a way to bring data into the chip as quickly as possible, utilize whatever is necessary before sending it on, and then replacing it with other data. This is particularly difficult with graphics and machine learning, because both require very large datasets.

“Concentrating on the compute versus load store energy use is something we focused on from the very beginning,” said Jem Davies, vice president and general manager for machine learning at Arm. “The fact that we use compression in several different places in the GPUs, and the way we’ve moved on to do that with our neural network processes, is a reflection of that. We can compress it and decompress for less energy than it would take to load and store it uncompressed. It also affects our cache architecture, as well. In terms of the loading and storing, one of the key metrics in our NPU designs is the number of times that a particular piece of data will get loaded, or indeed stored — but very particularly loaded from main memory into the chip. It’s something we track closely in simulation and prototyping, along with intelligent movement of that data. Numerous little processors inside our other processor designs is precisely to drive that. The actual intelligence of moving that data is crucial to overall energy usage, which is one of the critical factors of an NPU. Why would you use an NPU rather than a CPU? Because it’s more efficient.”

Segars said moving that data around, caching it in the hierarchy of caches, and understanding how that impacts the system side are all critical. “This is where architecture and design philosophy goes beyond just the instruction set and into the structure of the devices that the people are building. 3D stacking, and the high bandwidth that you can get interfacing between memory or in fact different processes on different stacks in a 3D die is going to be a super important piece of technology for keeping up with delivering more and more performance with each process generation. People talk about the end of Moore’s Law already in the transistor scaling, but for me, that move into three dimensions is going to be one of the areas that helps performance just keep improving generation after generation.”

New tradeoffs
With so many existing and new challenges to consider, making the proper tradeoffs is critical.

“To understand the tradeoffs, we need to look at include different system implementations,” said Ramesh Chettuvetty, senior director of marketing and applications at Infineon Technologies. “In the case of a distributed cloud computing system, the compute workload is partitioned between the cloud and the edge node, considering several factors including latency requirements of the application, overall power efficiency, power availability at the edge device, etc. Most of the time, first-level data analytics are handled by the edge device at the source of data in order to save energy and reduce latency resulting from data transfer. However, adding more compute workload on edge devices increases their cost and power consumption. The tradeoffs here are in optimizing the overall system cost, power and performance. For some applications considerations like latency and data confidentiality outweigh the cost/power tradeoffs.”

If the edge device is considered a standalone system, power reduction is achieved by optimizing the partitioning and distribution of data storage elements. So data that is frequently accessed gets stored close to the processing unit, while data that is infrequently accessed data is further away. The closer the data source is to the compute engine, the lower the energy loss resulting from data transfer. Therefore, systems typically implement a tiered memory approach with L1 (frequent data), L2 (less frequent data), and L3 cache elements associated with each processing unit.

Embedding more memory in the central processing unit, however, is expensive and isn’t the optimal choice for general-purpose SoCs. Another approach is to embed distributed compute elements in the memory device itself. To take this one step further, innovative approaches like in-memory compute are being explored for several AI compute workloads.

“Most often the tradeoff is in system cost and power efficiency,” Chettuvetty said. “Decisions include answering the questions, ‘How much storage should be embedded in the central SoC? What should the L1/L2/L3 cache partitioning be? Should there be shared storage (with cache coherence) between compute elements? Should we use a memory with distributed compute elements?’ This means general-purpose SoCs need to determine the optimal amount and partitioning of storage across compute elements to best service the applications the SoC goes into. For custom ASICs meant for targeted applications, the partitioning decision is much easier than general-purpose SoCs.”

Custom-designs using multiple chips add a whole new aspect to tradeoffs involving power and performance and how the various pieces interact. “This is part of the whole chiplet discussion going on,” said Ansys’ Swinnen. “An MCM (multi-chip module) just uses standard chips that communicate through standard I/O interfaces, which could just as well be mounted on a PCB. But the idea behind chiplets, which has been realized only by vertically integrated companies so far, is that you really lower the power of the inter-chip communication. So instead of using standard I/O drivers, you use a much lower-power, higher-speed protocol that only works across a few millimeters. That’s how they’re trying to address the tradeoffs of moving data around. But unless you build all the elements yourself — the chip itself, all the neighboring chips — so they all work together because you design them to work together, that’s hard to orchestrate across a heterogeneous industry.”

Optimizing compute for specific applications adds other challenges, as well. “On the chip side, it’s an engineering discipline. On the other side are the algorithm experts who understand what the masks are and what they want to do,” said Michael Frank, fellow and system architect at Arteris IP. “TensorFlow [the open source platform for machine learning from Google], was a significant departure from what people were doing initially — especially in machine learning, reducing it to some linear algebra sequence of operations that are very well understood. They’re replicable, so now you can build an engine like this TensorFlow accelerator that Google has. But it’s a multi-disciplinary effort. You need to have a team that has the knowledge on the side of algorithms. You need to have the engineers who know the memory architecture, memory capabilities, and you need to have the silicon engineers that know the process side.”

That complicates things even further. “These teams need to be paired with the accelerator engineers, who have their algorithm. They would build their little accelerator, try to optimize the memory flow, but never intrude into the memory domain,” Frank said. “It was the same in the other direction. The problem is these typical sequential algorithms that we had in the good old days of linear algebra, graphics, and computing in general processes, were not really well-suited for integration with memory. If you look at things like memory access patterns, the access patterns very often were driven by the sequential nature of a processor CPU, exercising instruction by instruction. This kind of compute paradigm drove the way memory was supposed to deliver data. It drove the whole way people thought about building the hierarchy.”

Big picture, what chip design teams really need to know about making their tradeoffs is similar to buying a high-priced commodity. “You need to know the prices, and what’s available,” he said. “That’s the elementary part of being an architect. You need to know what is possible. You need to know what the technology can deliver. At the same time, you have to be willing to use unconventional things. Sometimes you have to say, ‘Yes I know this may be a stupid idea, but let’s have a look at it.’ This is where the engineering part comes into it. While you need to know what is available, you also need to have a way to define your cost function, and you need to know what you can afford. Then, you must have a concept of what you want. The other side, especially in computation, is that you need to have a decent base of benchmarks you can run with. Part of architecting is innovation and inspiration. A lot of it is knowing what is out there, to know the history, otherwise you have to repeat it. Next, be innovative and creative, which sometimes means you have to think out of the box.”

Decisions are required at every level, Siemens EDA’s Saha said. “There are decisions at the SoC architect level on how to create the architecture, and how to play with the microarchitecture. There are also decisions for the verification engineer, because now they have to verify that all of the architecture changes actually improve the performance and energy efficiency. More architects are talking more about energy efficiency now than just raw PPA, or performance or throughput or latency. It’s all of these considered together. This is driven in part by AI, along with the huge investments into networking, storage and compute, and how to process the data. This impacts how you design things, how early you measure things. You cannot be too late in measuring things. You have to do it early. How easily can you experiment with an architecture? How do you measure energy? And how do you measure the energy aspects of everything? Once you have all of these pieces together, then you make your real architecture decisions on what should be.”

Having a clearly defined set of requirements is critical to understanding what a design team will use as its criteria. “When users have done that, things have tended to be very successful because then everybody has a concrete set of goals and guidelines,” said Cadence’s Wall. “As much of that work the engineering team can do based on their requirements upfront, in terms of defining what the goals are both from a computation and data throughput point of view, makes it easier.”

Wall noted it is helpful to not get wrapped up in a spreadsheet comparison. “There’s certainly a lot of quantitative criteria that need to be used, but there’s a whole bunch of other less quantitative criteria that go into a decision, as well. This includes maturity of the product being looked at, the software tools support around the product, the reputation of that vendor in the marketplace, among the end users. Those are all some of the criteria that the engineering team should look at as well.”

Conclusion
Moving data processing from the cloud to the edge is gaining momentum, but that alone isn’t sufficient.

“From a system level, we need to do the same thing,” said Synopsys’ Durrant. “Instead of moving large amounts of data around infrastructure, let’s move infrastructure closer to the data. In a computational storage approach, for example, scalable processors that are optimized for embedded use — but which can run applications within those devices — can be put very close to the data. You can create a scalable storage infrastructure that has scalable computers, with many levels of scalability in it to do things like encryption and compression of data, and to do database key value store offloading and video transcoding. With the huge amount of traffic that’s generated by video, there’s a lot of transcoding that goes on for all that which is a lot of processing. Also, for the storage device, if you can save another round trip to the host CPU, and put in a robust and scalable processor within the storage device itself, this can serve in many ways to bring compute closer to the data, and reduce the data movement within the infrastructure.”

Still, finding the optimal set of tradeoffs is difficult. Everything changes over time, and with nascent technologies, those changes can be significant.



Leave a Reply


(Note: This name will be displayed publicly)