Re-architecting Hardware For Energy

The industry is at a turning point. Power has been a second-class citizen when it comes to optimization, but thermal is becoming the great limiter for what will be possible in the future.


A lot of effort has gone into the power optimization of a system based on the RTL created, but that represents a small fraction of the possible power and energy that could be saved. The industry’s desire to move to denser systems is being constrained by heat, so there is an increasing focus on re-architecting systems to reduce the energy consumed per useful function performed.

Making significant progress requires breaking down of silos. In many cases it requires teams like hardware and software, digital and analog, or semiconductor architects and packaging coming together to create solutions. No one team can do it all, but it is certainly possible for one team to destroy all the work done by the others.

“Power has been and will continue to be a major limiter,” says Jeff Wilcox, fellow and design engineering group CTO for client SoC architectures at Intel. “Fortunately, we have been able to peel one more layer off that every year and make strides. Thermal density remains a challenge and drives more and more layout constraints, especially for CPUs and areas where we have a high thermal density. We have constraints both in terms of hotspot thermal problems and sustained thermal problems.”

Challenges broaden when looking at new packaging solutions. “Moore’s Law is slowing and performance and power are no longer automatically improved by moving to the next technology node,” says Tim Kogel, principal engineer for virtual prototyping at Synopsys. “In order for multi-die systems to serve as the silver bullet to continue scaling, the power consumption of chiplets will need more attention at the architecture level. Due to the geometrical dependencies of the dies inside a package, it will not be possible to easily ‘fix’ power density issues by increasing the die area and/or adding more power supply vias. The power delivery network of a multi-die system will have to be planned upfront and all components need to adhere to the spec.”

It no longer is a back-end optimization task. “The focus has changed from low power to energy efficiency,” says William Ruby, product management director in Synopsys’ EDA Group. “You may look at things like, ‘Do we run at a particular clock frequency all the time, or do we run much faster and then stop?’ While total average power consumption may be similar, energy consumption is different. That’s where this is really coming from, and it’s driven by different applications. Performance per watt is absolutely critical in data center applications.”

Energy needs to be architected. “Two different architectures can have the same energy footprint but different power profiles,” says Qazi Ahmed, principal product manager at Siemens EDA. “A high performance, low area architecture will have a power envelope of high sustained power dissipation that can lead to thermal problems downstream. Energy consumption is an essential specification on the datasheet. Any methods to optimize power must lead to improved overall energy efficiency. A theoretical energy draw is possible only if all work done is useful. Practically, one must make sure that the energy consumed is linearly proportional to the amount of work done. An energy proportionality plot over different scenarios, from idle case to peak cases, can reveal areas of low power efficiency that need to be looked at.”

That requires you to understand more about how a system is going to be used. “Power cannot be defined in a vacuum,” says Synopsys’ Ruby. “You have to say, this is our power spec. This is our power goal in the context of a workload or an application that this device will eventually run. You may have different targets and different workloads, and everyone in the development team has to work towards those.”

It becomes a lot more difficult when the target application is not known. “It feels like we hit the wall every time, but somehow we do manage to push through it — perhaps by using an N-1 node for something like a GPU, where you can get better power efficiency,” says Intel’s Wilcox. “By going wider, we can take advantage of some cost savings from using an older process that has a slightly cheaper wafer and design a bigger GPU. Then we can run it at lower voltage and slower, running a fat, wide machine, rather than a narrow high-voltage, high-frequency machine. There’s no one rule. It’s usually based on where the constraints are right now.”

Sometimes the necessary solutions are drastically different. “In the quest for energy-efficient computing, the spotlight is turning towards analog computing,” says Dave Fick, CEO of Mythic. “Analog computing stands out for its exceptional information density, which drastically reduces the need for transistors and wires. The technology enables the implementation of computational functions with 10 to 100 times fewer components. As a result, it offers a significant reduction in energy consumption, latency, and cost. The key is to identify scenarios where ‘fuzzy’ processing aligns with system performance needs.”

The transfer of data from one form of memory to another performs no useful end function. It is a necessary evil to bring information and compute together. For decades, there has been both a performance and power wall between those two elements, and the industry is now looking much deeper into how this can be minimized, or even better, eliminated.

Reduction of that wasted power is becoming critical. “There are a lot of micro-architectural tricks that have been applied on the compute side to bring the compute implementation power down,” says Sailesh Kottapalli, senior fellow in Intel’s data center unit. “That has made big progress. But if you do a power profile when you’re executing some instructions, where is the energy spent? Data movement, whether it’s from caches or from memory, is the bigger portion of that. The reduction of data transmission energy is the next frontier of power efficiency. One of the big advantages of 2.5D and 3D is trying to cut down that portion of the energy.”

The industry has to be smarter about minimizing data movement. “The energy cost of computation is being outweighed by the energy costs of data movement,” says Renxin Xia, vice president of hardware at Untether AI. “This was true before, but with the advent of large language models, it is becoming even greater. From our internal analysis, we’re seeing that the ratio between the size of the models and the computation required is increasing, but not to the same scale. The ability to move data around and be energy efficient is becoming even more critical. Previously, it has been posted that around that 90% of the energy is spent moving data rather than computation. It’s only going to get more extreme going forward.”

Some of the unnecessary data movement is because of silos. “Traditionally, we’ve had isolation between the GPU and the CPU memory,” says Wilcox. “They didn’t share the same memory locations. That meant you had to copy within the same physical memory system to another area in that same system so the GPU can work on it. That is incredibly wasteful. We are working with Microsoft to implement techniques like shared virtual memory, where we can get past some of that and allow them to hand off a pointer. Then they can operate directly on the memory, as opposed to moving it around. Some of the rigid structures inserted to partition work in the past have an increasing penalty associated with them, and we have to break those things down.”

New memory organization is being considered for AI. “With AI models becoming bigger it’s not possible to fit everything on chip, or locally, so you have to swap things in and out,” says Untether’s Xia. “Then you try to be smarter about your data movement. You optimize the distance of the movements, you try to architect your network, or the network on chip, to move data to its nearest neighbor for the operation of the layer in the neural network, and try to minimize the data movement. Even though there will be some swapping, just reducing the amount of that, and reducing the distance it goes across the chip can help.”

Near-memory compute is a technique being used by several companies. “To achieve substantial power savings in the face of diminishing returns from traditional semiconductor scaling, designs will likely need to be re-architected,” says Guillaume Boillet, senior director of product management and strategic marketing at Arteris. “This re-architecture can encompass moving away from one-size-fits-all processor designs to systems that incorporate a mix of specialized processing units and designing those specialized processors and systems with energy efficiency as a primary objective, by adopting near-memory or in-memory computing to reduce the energy cost of data movement, for instance.”

In some cases, computation can be done in memory, completely eliminating the movement into a processor. “Analog compute-in-memory is one example of how huge operations can perform efficiently in analog computing, such as dot products,” says Mythic’s Fick. “There are many other opportunities to add computing to datapaths for denser, faster, and more efficient computation. It will be exciting to see any flavors of analog computing coming in the coming years.”

Back-end reduction
While some say that most of the large optimizations are to be made at the system level, there are many more back-end reductions that are still possible. “The process nodes have been affording us, generally, lower voltage,” says Wilcox. “While we are not getting some of the advantages of scaling we have become accustomed to, power has still been scaling down with every process node. As we’re driving for lower voltages on both Vmin and Vmax, we’ve been getting benefits out of that. We have been able to continue that power performance trend.”

Systems need to continue to scale. “Using two-dimensional silicon, there’s only so much memory and compute you can fit,” says Xia. “Most companies do have a scale-out strategy, such as multiple chips on a board, multiple boards in a system, and multiple systems in a rack. Eventually you should be able to fit all models in a two-dimensional fashion. The other approach is to go vertical. By going vertical, going across different die, we can use different memory technology. We can take advantage of denser memory technologies like DRAM. That will give us at least an order of magnitude denser memory.”

Going to 3D has other advantages. “There has been a recent shift away from monolithic 2D integrated designs to disaggregated designs mapped to multiple dies from heterogeneous manufacturing processes, integrated using advanced 2.5D/3D packaging,” says Vincent Risson, senior principal CPU architect at Arm. “This enables the targeted use of the latest process nodes to the areas where it matters most for energy efficiency. Advanced 3D integration provides the opportunity to change the memory hierarchy by providing larger localized caches, or adopting new disruptive memory technologies whilst still maintaining low access latencies and reducing downstream power consumption. For example, in cloud computing today, many of the challenges we see relate to compute density. Advanced 3D integration not only addresses the reticle limit, but enables parallelism by providing an additional vertical dimension for SoC network architectures.”

Distance is the key. “The interconnect essentially presents a capacitive loading to the die, and that capacitance needs to be charged and discharged as the signals transition,” says Synopsys’ Ruby. “The formula for dynamic power is capacitance times voltage squared times activity. Capacitance is reduced when you make the interconnect shorter, as can happen with 3D integration. You can play with activity and send the data only when you need to. There is also voltage. There is some work in the low voltage differential signaling (LVDS) domain where the signals are not going full swing between chips, but they are more analog-ish in nature. The voltage swing is reduced, and therefore the power consumption associated with charging and discharging that capacitance is also reduced.”

Vertical stacking can provide a significant advantage. “With a 3D vertical stack scenario, we have lots of vertical connections between die, rather than everything going out the side where you’re limited by your perimeter,” says Xia. “We can pack a lot denser vertical interconnects between the dies. And then, because of the proximity, because the dies are stacked right on top of each other, you get better energy efficiency on a picojoule per bit basis, thanks to the laws of physics.”

But it does come at a cost. “Die stacking, depending on what you’re putting where, can be a real challenge,” says Wilcox. “You are putting more impedance between the heat source and the heat sink. Some of the power issues associated with disaggregation help us, where we can take areas where they don’t need to be on a higher performance process. We can move them to an older node and take advantage of that, but we do have to deal with new issues that are created.”

While going to new nodes can help, that creates additional problems. “New nodes may be providing decreased power characteristics in the traditional sense, but they are also adding extra overhead like glitch power,” says Siemens’ Ahmed. “The distribution of net delays versus gate delays at lower technology nodes is leading to unforeseen dynamic power due to glitches. Those could be as high as 40%. Design houses developing computationally intensive logic for AI accelerators need to upgrade their existing power methodologies to make sure they account for glitch power and ways to mitigate it.”

Creativity remains important. “Power delivery is a great example,” says Wilcox. “The more you are able to be resilient to transients, then you don’t have to bake those into your voltage margins. Those are really expensive power-wise. Being able to find ways to accommodate uncertainty around current spikes, which could cause voltage drop below the level of being functional, means you don’t have to keep your voltage higher to accommodate for that. Those types of techniques are really important. They can be as impactful as some of the other big sexy features.”

Techniques such as these only become possible by doing detailed analysis. “You can implement chip-level power techniques like dynamic voltage and frequency scaling, or power shut off,” says Ruby. “All of these things need to be thought off, analyzed, and assessed, in terms of the trade-offs are. If you shut off a block, it is unlikely to wake up in a single clock cycle. You need to give it time to wake up and initialize. In the meantime, the system may be sitting and waiting for that to happen. There is impact on performance as well.”

While it may be true that the biggest gains are to be made at the system level, there are many optimizations still available at the technology level. There is no tool that will tell you the theoretical minimum power required to execute a particular function, but that doesn’t mean there aren’t plenty of opportunities for improvement. Some of them may require creative thinking to locate them, while other technology advances enable system-level improvements.

Related Reading
Using Real Workloads To Assess Thermal Impacts
A number of methods and tools can be used to determine how a device will react under thermal constraints.
Navigating Heat In Advanced Packaging
New approaches and materials being explored as chip industry pushes into heterogeneous integration.

Leave a Reply

(Note: This name will be displayed publicly)