Lowering Energy Per Bit

Why reining in energy requires changes across the entire semiconductor ecosystem.


Energy is emerging as a focal point in chip and system design, but solving energy-related issues needs to be dealt with on a much broader scale than design teams typically see.

Energy is the amount of power consumed over a period of time to perform a given task, but reducing energy is a lot different than reducing power. It affects everything from operational costs and system performance to architectural floor-planning, verification, and overall system reliability. In data centers, energy concerns are driving changes about where data is processed, how and where it moves, and how long it spends at each step. In advanced packages, it determines everything from layout to thermal management. And in vertical markets like automotive and IoT, it determines how long a device can run on a single battery charge.

But improving energy efficiency requires rethinking many of the processes that use chips, as well as the chips themselves. “In the case of a big data analytics application, you’re often searching for a needle in a haystack,” said Steven Woo, fellow and distinguished inventor at Rambus. “I’ve got disks and disks full of data, and I may only need one or two pieces of it. But if I have to search through all of those in the conventional way, you take everything in the disk, transfer it to a CPU, then the CPU will search through everything and throw away 99.999%.”

This means a lot of the work being done is actually wasted, which translates to wasted energy. And it’s giving rise to alternative methods of data processing and data storage.

“As a result, what people do with their array of disks is transfer all the data in parallel so it comes faster, but in the end they still have this one CPU, which is the bottleneck, searching for the data,” Woo said. “Another approach, after they recognize that that takes a lot of time and energy, is to allow each of the disks to have a little bit of smarts in them so they can then search in parallel, and only send back the data that matches a particular request. This means bandwidth and energy aren’t wasted moving data that is never going to be used. It all stays local, and I only see the data which matches my criteria.”

There are two key drivers for this. One is that there is more data to process, even though power budgets are flat or shrinking. The second is that more computing is being done in devices connected to a battery, or where there are more analytics to measure energy consumption.

“Energy has increasingly reared its head because it’s a different aspect of looking at a problem that’s becoming more prevalent,” said Rob Knoth, product management director in the Digital & Signoff Group at Cadence. “With a larger amount of intelligence being integrated into the edge, powering it and battery life are becoming more of a prevalent concern. In other areas like hyperscale compute, where the challenges are orders of magnitude worse than the traditional challenges including reliability, power density, and yield, it’s all happening in football-field-size computers. The traditional ways of solving those problems only work to an extent, so it means working smarter rather than harder.”

Or viewed differently, copying data isn’t free. “It’s expensive in terms of time and energy and storage space to make copies of data for different applications,” said Scott Durrant, DesignWare IP solutions marketing manager at Synopsys. “And to the extent that data can be shared by allowing multiple processing devices or applications to access the same memory, you not only can save the energy and the time needed to move that data around, but you also then have just one single source of truth (SSoT). You don’t have multiple copies of data that might get out of sync, where you don’t know which one is the latest or most correct. And so just maintaining a single copy and allowing different devices to access data in that copy, as can happen with a cache coherent type of interface, becomes very valuable.”

These considerations impact architectural choices because the application, as well as the data stream and the type of data, can drive the selection of cores that are integrated into an SoC.

“When we talk about data transfer, we measure power or energy consumption in picojoules per bit, so for every bit that you transfer, a certain amount of energy is required to move it over the wire and through the intermediate devices that it has to traverse,” Durrant said. “And when you talk about energy per bit, as you increase the number of bits that you’re transferring, the energy goes up. We’re finding that with some of these very high data rates using traditional technologies, the amount of energy needed to transfer terabytes of data just becomes untenable. It creates a tremendous amount of heat, and it’s expensive. Data center operators have to pay for that energy. It’s not considered environmentally friendly and so there are a number of reasons why keeping the energy under control is important. This is why the architects of these devices, who implement the protocols, are looking very hard to find ways to transmit data at lower energy per bit.”

Another approach is to reduce the amount of data that actually has to be moved. “If you have data stored in one location and you copy that into memory five different times, you’ve used a lot of energy to make those copies. If you can copy it into memory one time, or better yet, even just leave it where it is and process it in place, then you can save a tremendous amount of energy.”

Moving data is expensive in terms of energy and time. The less data that can be moved, the better.

“Once you have the ability to only send back the information that really matters, maybe you can even do some simple compute in the storage, as well,” said Rambus’ Woo. “And once you have that ability to send back a smaller amount of much more meaningful data to the CPU, then the CPU is going to try and hold on to it as long as it can. It can perform techniques like weight stationary, where it just holds the data and tries not to move it. What you’re hoping in all of this is to minimize the movement of data at the disk.”

Thinking bigger
This now becomes a critical factor in architectural choices for an entire SoC or system.

“If we just try to push the RTL-to-GDSII-implementation-signoff flow harder, we will be successful,” Knoth said. “We might make a 5% to 10% gain on one design-specific thing, and we would all pat ourselves on the back and say we’re doing a great job. But that’s not going to achieve the gains that are being asked of us as an industry. We can’t be satisfied with 5% gain here, or 2% gain there, or 10% gain there. We’ve got to be looking for the 50% gain, the 2X gain, whether that’s from a productivity standpoint or from an end power/performance/area/energy gain. We’ve got to be looking for those, if we want to achieve the goal that we all have in front of us in the industry.”

This is where energy starts coming into focus. “If you start talking about energy, you’re talking about actual work that a system is doing rather than just talking about it in a moment in time, which is a static event,” Knoth said. “When you talk about the work that something’s doing, you naturally start talking about its architecture. That’s where the big wins happen, but traditionally, that’s been on the other side of the wall. And when I’m talking about RTL-to-GDSII, by the time I’m talking about RTL that’s written, a lot of decisions have been made that have potentially forced you into a local minimum. If you want to make a 2X gain, you can’t be satisfied with that. You’ve got to be able to go back and say, ‘What is this whole solution space I’m looking at? Could I have doubled the clock rate and halved the bus width, and achieved lower overall total power and energy with that?’ You’re not going to run that experiment at RTL, and that’s not something an RTL-to-GDSII platform is going to do. You’ve got to be looking at the architecture at the SystemC level, at the MATLAB level.”

This is not only about a particular tool. For architects and designers, it requires changing the way that they take into account the work they know their device is going to do and thinking about it at a higher level.

Knoth noted this shift coincides with the evolution of computational software and EDA. “Think about where our industry has come from, and the walls we’ve already broken down. We’re good at bringing people to the table together, and getting them to form connections. It’s getting people who are upstream in a tool chain to use the same engines, the same algorithms, the same tools that someone downstream is using so they can predict better, so that they can make better decisions earlier. They need to be able to make architecture-level decisions and not have it be a surprise. It’s not just, ‘I’m going to make a new tool, go sell that new tool, and pat myself on the back because I won.’ It’s about creating new conversations between people who weren’t talking in the first place. It’s about sharing technologies that in many cases already exist, but they’re not being used by the people who can have the biggest impact.”

Back to fundamentals
The ultimate objective for the RTL engineer or the architect is to make energy-efficient systems.

“Fundamentally, higher energy means more power, means more heat dissipation,” said Qazi Faheem Ahmed, principal product manager at Siemens EDA. “It means lesser reliability or increased cooling costs, etc. All of these concerns are getting bigger. On the top of that, exascale computing is creating lots of problems for energy consumption in CPUs, GPUs, and analog chips. Even IoT devices are supposed to last a very long time on small batteries. We know that energy is power integrated over time. Let’s say there’s a job that dissipates X amount of power for the duration. The only way to cut down on the energy dissipation is either to do that faster with the same power, or take more time but consume less power, or take the same amount of time and consume less power.”

Those relationships change significantly, however, in the context of a larger system. “How you reduce power ultimately results in how to reduce energy,” Ahmed said. “But looking at the bigger picture in terms of a system that consists of a number of CPU and GPU cores, how would that play out, and how do you know where you have problems with taking an extra amount of energy that is really not resulting in throughput? It’s understood that the more the throughput, the higher the power consumption would be. But how can you control power without compromising on the throughput? That’s where it becomes interesting. We have customers who are asking how to manage the energy consumption of an SoC, or even at IP level, how to assess which operations, or which part of the RTL code, is dissipating more power or consuming more energy than it should? It means they’ll have to try out different kinds of workloads or scenarios, and for each scenario there’ll be a certain power number or an energy consumption number, and there is lots to understand about where to look to optimize for energy. That’s where things stand today, and software tools are not equipped to handle and provide that kind of a methodology to dive deep into the RTL code, the workload, and give the user an exact picture where the energy is being wasted.”

Fig. 1: Energy proportionality plot. Energy should rise with increased throughput. In addition, different workloads need to be analyzed on energy consumption vs. throughput, and regions of disproportionate energy consumption should be the focus of power optimization. Source: Siemens EDA

This has broad implications for a number of industry segments. Consider the electrification of cars, for example. Electric vehicles require embedded systems with high performance and low power consumption to maximize the miles per charge. To achieve this, automotive OEMs and Tier 1s are transitioning to advanced process nodes to optimize performance at low power.

“One of the design considerations for a memory subsystem is the energy consumption for code execution from an external DRAM, which can get 2X higher at extreme temperatures above 125° C,” said Sandeep Krishnegowda, senior director of marketing and applications at Infineon Technologies Americas. “One of the ways to reduce this energy consumption is to execute in place (XiP) from a high-performance external NOR flash to provide lower energy per bit and reduce system cost compared to DRAM in such applications.”

Thermal impacts energy
There is a thermal aspect to energy, as well. With smaller geometries, and more designs being assembled into chiplets and 3D stacks, in-package thermal aspects need to be analyzed, verified, and optimized. In the case of advanced nodes, problems stem from thinner wires — it takes more energy to drive signals through thinner wires over longer distances — and logic density, where more processing in a given area generates more heat. In the case of advanced packaging, the problem is related to heat dissipation, which is particularly problematic with logic next to logic in 3D-ICs.

All of this makes thermal analysis more difficult because there are different time constants between electrical timing and thermal timing. “Thermal energy spreads slowly, on the order of seconds, while other power spreads in nanoseconds or microseconds,” said Marc Swinnen, director of product marketing at Ansys. “It’s a very different time constant. When you look at the instantaneous power, the power use is not that important for thermal, but the energy is. If it’s just a little blip that uses a lot of power very briefly, there’s not a lot of energy that’s going in there. But if a lot of these blips are occurring, on average, then a lot of energy is being used. Standing back and looking five layers up to see the heat that is finally soaking up to the heatsink — that is more about the energy being used than the power.”

With through-silicon vias running through thinner chips in a stack, there also are more local power effects to deal with. “Even though this only lasts a microsecond, it can cause a local thermal peak,” Swinnen said. “We have a customer test case, for example, where they have thousands of bumps and micro bumps that allow the current between the two chips to communicate, and there is a simulation that shows some of these bumps carried intense currents briefly. They heat up, but then they cool down again. Then the next surge comes on, it heats up again, but the circuits haven’t cooled back down to where they were originally. Each time it heats up, they don’t cool down as much. They heat up some more, and finally, after approximately 14 seconds, the bump melts. This sort of thing is unusual. People are not used to it, especially coming from the timing side. The power is okay in each case, but the total energy being dissipated in that device over time is causing it to melt.”

Javier DeLaCruz, senior director of system integration at Arm, points to similar problems. “Designs assembled from chiplets introduce new elements to consider, assuming we are talking about 2.xD configurations, such as different maximum junction requirements. In this example, devices in the same package may need to be more thermally decoupled. Advancements with heat spreaders with embedded heatpipes are one effective way to mitigate the differing thermal needs of each of the chiplets. The chiplets do need to be modeled individually to determine their heat generation signatures. Then these chiplets need to be considered as part of the packaged system, including the coupling of these heat sources, which impacts each chiplet, and which then may have impacts on timing and other performance metrics.”

When the packaged chip is housed in a system, there are other thermal considerations and cooling requirements. “The complexities vary between handset systems and high-performance compute systems considerably, but each has a limit to what the system can handle, and it may be the system that has the limitation,” DeLaCruz said. “Often, packaged parts are designed and thermally simulated with an abstract set of assumptions that may not represent the complete system enclosure. This can pose a challenge at the system enclosure, as there can be many parts vying for the same thermal dissipation path.”

Also, in cases where it’s possible to abstract states of energy consumption based on software workloads, the software running on those designs require more accurate models for dynamic power consumption and thermal effects. “There are several tools in the toolbox for thermal management. One tool is to leverage the thermal mass of the packaged system to allow for adequate warning of a thermal event. It takes many clock cycles for a part to heat up, which can be observed and compensated for within software,” DeLaCruz added.

For all energy design concerns, it’s important to remember that power is not a function of one or two things.

“It’s a complicated thing, and it’s impacted by toggle activities,” said Siemens EDA’s Ahmed. “It’s impacted by the technology being used. It’s impacted by the way you design, and all of these different things. There has to be some capability in the tools, and then some expertise on the part of the RTL designers to leverage those capabilities. You have to qualify your workload to be sure that you’re estimating power or energy using the right workload to get a realistic picture, and that you’re using the right techniques to reduce power and focusing on the right places rather than implementing certain techniques that may not help in the bigger picture. Ultimately, analytics can help make sound judgments at the higher level, whether your SoC’s energy efficiency has improved or stayed where it was.”

More Data Drives Focus On IC Energy Efficiency
Decisions that affect how, when, and where data gets processed.
11 Ways To Reduce AI Energy Consumption
Pushing AI to the edge requires new architectures, tools, and approaches.
Low Power Still Leads, But Energy Emerges As Future Focus
More data, more processors, and reduced scaling benefits force chipmakers to innovate.
Problems In The Power Grid
How the electrification of everything will affect power generation, storage and availability.

Leave a Reply

(Note: This name will be displayed publicly)