11 Ways To Reduce AI Energy Consumption

Pushing AI to the edge requires new architectures, tools, and approaches.

popularity

As the machine-learning industry evolves, the focus has expanded from merely solving the problem to solving the problem better.

“Better” often has meant accuracy or speed, but as data-center energy budgets explode and machine learning moves to the edge, energy consumption has taken its place alongside accuracy and speed as a critical issue. There are a number of approaches to neural networks that allow for reductions in energy consumption.

“Everybody’s looking for alternate ways of implementing neural networks that are going to be much lower power,” said Elias Fallon, software engineering group director, custom IC & PCB group at Cadence.

Others agree. “Google itself is worried about the power during the creation and training of the neural net,” said Venki Venkatesh, director of R&D, AI & ML solutions, digital design group at Synopsys.

Every decision made about a neural network has implications for energy. The model itself affects power, as does the choice of implementation hardware. While performance and power historically have opposed each other, many of the lower-energy choices also can increase speed. It’s accuracy that may be more likely to suffer with lower-energy implementations.

We frequently talk about “power consumption” as if power were something that could be consumed. But power is the rate of energy consumption. “Power is super important for a silicon vendor that is making these things. They need to make design power rails and voltages correctly,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “But the bigger thing for IP selection is lower energy.”

Said another way, if the limitation is heat, then power is most important. If the limitation is battery life, then it’s energy consumption, not power, that matters.

“Energy saturation turns into power saturation, because you can’t cool the chips once they’re above a few hundred watts,” said Nick Harris, CEO of Lightmatter. “And then that turns into, ‘I can’t use the transistors to do computational work.’”

Less computing and less data mean less energy consumed. This drives a number of architectural approaches and specific tactics that lower energy usage. In addition, less data movement means lower energy consumption. These ideas can operate independently or be stacked together for greater savings.

1. Smaller models. All other things being equal, a smaller model will have fewer layers with fewer filters, and will therefore require less computing and less memory.

“We have seen an order of magnitude difference in compute between graphs that work on the same data sets,” said Fergus Casey, R&D director for ARC processors at Synopsys. “Sometimes the graphs optimized for lower compute requirements don’t even degrade the accuracy of the results.”

Unfortunately, simply choosing a smaller network to train may not be a good solution here. In an area where Occam’s Razor could be applied to great effect, the current tendency toward larger models points to the challenge. If smaller models were always sufficient, there would be no need for big ones. And yet they keep getting bigger.

That’s driven by the need for higher accuracy and more sophisticated decision-making. Doing a better job across a wider variety of training samples can require bigger models. Handling more nuance within a field of samples (distinguishing that cat-looking dog from a dog-looking cat) can also require more heft.

As we’ll see, there are ideas that make a model smaller on a tactical basis, but if MobileNet were sufficient for everything, then we wouldn’t need YOLO. Instead, the smaller models are chosen when the computing platform and energy source so require, with accuracy as a potential tradeoff.

2. Moving less data. This is probably the most visible of energy mitigations, given the widespread acknowledgement of the effects of hitting the memory wall. Solutions here are highly architectural in nature.

While the prior approach suggested reducing data, the memory wall is less about reducing data than about moving it around less and across shorter distances. For a given amount of data, architecture can be a determining factor.

The worst-case scenario would be a single CPU processing an entire network. For each computation data must be fetched, operated on, and stored away again for later use. Instructions also must be fetched and decoded for a given set of operands. The overhead dominates, hurting both power and performance.

On the other end of the spectrum are architectures where weights stay resident and activations move incrementally forward within a chip. By making the chip large enough to contain an entire model, weights can be loaded once and left in place. By locating small chunks of SRAM at the output of each filter stage, activations can flow to nearby cells for further computing, rather than having to go all the way back into DRAM and be brought out again later.

Keeping weights stationary is only one option, however. “Weight-stationary may be most efficient for traditional convolutions, but non-batched fully-connected layers will be more efficient in input stationary or output stationary mode,” said Casey. The latter refers to generating a single, almost-final result rather than storing and later combining a series of partial results.

This is an area where the distinction between power and consumed energy can differ widely. In theory, if computing is all funneled through a single core that consumes x energy, then the energy consumed for y computations will be x times y.

If an array of the same core was built with enough cores to handle all of the computations in parallel, then you have y cores doing 1 computation instead of 1 core doing y computations. The energy consumed should be the same, but the fact that the parallel version consumes the energy in a much shorter period of time gives it much higher power.

Another way to boost locality is to do as much computing as close to the data source as possible. Andy Heinig, group leader for advanced system integration and department head of efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division, illustrated this with a sensor-fusion example. “For different tasks, it is very important to split the data processing between different places,” he said. “The first processing step should take place directly on the sensor and further processing on an edge node, which summarizes information from different sensors.”

In an application like an automobile, sensors are widely distributed, so there may need to be some central locus of computation to fold all of the sensors in. “Some decisions can be done only on the centralized processor, because only there is all information from all sensors available,” Heinig added. Minimizing that central computation means less data being fed from the far reaches of the vehicle.

In general, an effective compiler can optimize data movement through careful use of caches, FIFOs, and/or DMA operations.

All other things being equal, this architectural choice should not impact accuracy.

3. Less computing — where possible. For a given model, there may be ways to compute a result with fewer operations. One popular example for convolution is the Winograd approach.

“Using a Winograd transformation of convolutions can significantly reduce the number of computations needed for a large tensor operation, significantly reducing power in the process,” said Steve Roddy, vice president of product marketing, machine-learning group at Arm.

The scale of a straight-ahead approach to convolutional matrix multiplications is on the order of n3, while Winograd reduces that to n2. In particular, it reduces the number of “strong” computations (multiplications) at the expense of additional “weak” computations (additions).

The choice of activation function also can matter. “A small model that relies on tanh as an activation function may be less efficient than a slightly bigger model using simple ReLU,” noted Casey.

The choice of execution model also can matter. Using an interpreted approach like TensorFlow Lite or some other equivalent may consume more energy than a model that’s been compiled directly for the underlying hardware.

4. Batching helps — where possible. For some applications, like processing a collection of static images or frames, greater efficiency can be had by increasing the batch size. That means that a given configuration of the processing resources — kernel code and weights — can be amortized over more samples, resulting in fewer data and instruction fetches.

But this only works for some applications. And it typically applies only to data-center implementations, where an operation is being performed over a large collection of samples that are already resident.

Edge applications normally receive a single data packet at a time. By definition, they must complete the processing of a single packet before they can start on the next one. So this technique doesn’t apply well to some of the implementations that could use it the most — small battery-powered edge devices.

Batching also doesn’t affect accuracy. Results achieved concurrently from a large batch of data should be no different than the same processing done one complete sample at a time.

5. Data formats matter. The highest accuracy will be obtained by using the most precise numbers in the computations. This is why training typically involves floating-point numbers.

But floating-point circuits use more energy than integer circuits. And integers expressed with more bits use more energy than integers with smaller bit widths.

Quantizing to integer is common, although not universal. Where it is done, many companies appear to be settling on 8-bit integers. But researchers are also looking into smaller integers — 4 bits, 2 bits, and even 1 bit. This means less data storage and movement.

The limit is the single-bit implementation, which gives rise to so-called “binary neural networks,” or BNNs. “Not only does this reduce the amount of compute and the amount of memory, but you can actually get rid of the multiply circuit entirely,” said Linley Gwennap, president and principal analyst at The Linley Group, at the Linley Spring 2021 conference. “When you just have one bit being multiplied by one bit, it’s an XNOR logic gate that performs the function. So this is a great savings in compute and power.”

The tradeoff here is accuracy. “As you make these tradeoffs, you go from floating point to a quantized version and down to the lower bit precision,” said Mitra. “The fantastic part is that all these things are possible. The part that people forget is that nothing comes for free.”

Models may need retraining after quantization in order to bring the accuracy back up again. Quantization-aware training can help by incorporating the desired bit width into the original training, potentially obviating the need for a downstream retraining step.

Accuracy can also be tuned by allowing different layers to have different precisions. Training frameworks support this, balancing the need for low energy and higher accuracy.

Some companies have spent a lot of effort finding ways to reduce data sizes without harming accuracy. “IBM is able to achieve the same accuracy, from INT8 down to INT4, which is pretty impressive,” said Gwennap. “Even if using 2-bit integers, the accuracy is within a percent or two.”

There is one possible side-effect of reducing data size if accuracy is to be maintained. “Often the size of the nets increases with smaller data types,” said Heinig. “There is definitively a sweet spot. But the sweet spot may be application-specific.”

6. Sparsity can help. Neural-net computations are all about vector and matrix math. In theory, the amount of computing is determined only by the matrix sizes. But 0-valued entries create a zero multiplication result that’s known a priori. There’s no need to run the computation. So the more vector or matrix entries that are 0, the less true computation is needed.

In principle, it’s possible for one raw model naturally to be sparser than some other model. In that case, there’s no tradeoff involved in taking advantage of that sparsity. But usually, rather than matrices or vectors being truly sparse, they may have entries that are very small numbers. The thinking is that those entries have a negligible impact on the overall result, and so the small numbers can be changed to 0 to build in more sparsity.

This can be done during model development by pruning weights that are too small, resulting in sparser weight matrices. It also can be done to activations in real time, depending on the activation functions chosen, leading to sparser activation vectors.

Nvidia has a dynamic version of this involving masking off zero-valued weights. This adds circuit complexity while leveraging sparsity to a greater degree than other approaches might.

Fig. 1: Nvidia’s scheme can leverage sparsity in different positions by clustering the non-zero values and using a mask to ensure that the values are multiplied by the appropriate activation values. Source: The Linley Group

Fig. 1: Nvidia’s scheme can leverage sparsity in different positions by clustering the non-zero values and using a mask to ensure that the values are multiplied by the appropriate activation values. Source: The Linley Group

The tradeoff again is accuracy. When a given weight, for example, is pruned away, the assumption is that it wasn’t really going to make a meaningful difference in the result. And that might be true — for that one weight. But if a large number of weights are pruned, then the cumulative result can be lower accuracy.

By the same token, a given activation may not be affected much by introducing sparsity. But if that happens at every layer, then errors can accumulate, throwing accuracy off.

7. Use compression. For a given level of sparsity, compression can help to reduce the amount of data being moved around. At design time, weights can be compressed at the expense of decompression hardware. At run time, inputs or activations can be compressed at the expense of the compression and decompression hardware.

“The energy cost of decompressing is a lot smaller than what you would spend fetching the extra bits,” noted Ashutosh Pandey, senior member of technical staff, systems engineer at Infineon.

This compression, assuming it’s lossless, should have no impact on accuracy.

8. Focusing on events. Video consists of a series of images in sequence. Each frame will differ from the prior one – sometimes greatly, but more often subtly. A single object may move in the frame, for instance, while the background remains stationary.

Traditional approaches re-process that stationary information with every frame. Over time, that’s a lot of redundant processing. An alternative is to focus computing only on what’s changing in the stream. That’s typically far fewer pixels than the full frame, allowing for less computing and less memory.

This can be done using a standard neural network architecture, but by focusing only on change. For the first frame, the entire frame must be processed. For successive frames, a “diff” can be taken between the current and prior frames, directing the work only to those portions that changed.

Those changes are categorized as “events,” and this kind of architecture is an “event-based” approach. GrAI Matter is an example of one company that has a more or less conventional network, but that works in the event domain.

Another implementation of this idea is a spiking neural network, or SNN. SNNs are considered to be neuromorphic, and they don’t have the same structure as more conventional artificial neural networks (ANNs). Intel’s Loihi project is a massive attempt at corralling SNNs. BrainChip has a more modest commercial offering.

Fig. 2: Spiking neural networks mimic the brain, lowering power by focusing on events. Source: The Linley Group

Fig. 2: Spiking neural networks mimic the brain, lowering power by focusing on events. Source: The Linley Group

It’s not clear that using an event-based approach would have a specific impact on accuracy.

Casey noted one challenge with event-based architectures: Due to a lack of synchrony, “A fully event-driven SNN will not be able to exploit any vectorization.” This can result in less efficient computation — which could give back some of the energy saved.

9. Using analog circuitry. Analog implementations can draw less energy than digital versions – particularly where the computation is done on an analog signal. Converting analog to digital can use a lot of energy, so remaining in the analog domain, if possible, saves that energy. “You can reduce power by 10X, 20X, or more using analog computing,” noted Gwennap.

This can be particularly effective if the incoming sensor stream has a low level of relevant data. Energy is saved by not converting the irrelevant data into digital for analysis. Aspinity is an example of a company following this path.

In-memory computing (IMC) also leverages analog, but it features analog islands in an otherwise digital chip. Digital data must be precisely converted to analog for computation, and then the result has to be precisely converted back to digital, reducing the overall energy savings. But the claim is that the energy savings in the analog domain still make this a net energy win. Mythic is an example of a company using this technique.

IMC can reduce power in two ways. “Not needing to fetch data from memory is an overwhelming factor,” said Ramesh Chettuvetty, senior director for applications engineering and product marketing, memory solutions, for Infineon’s RAM product line. “But the other thing is that IMC massively improves compute parallelism.”

Figure 3: In-memory computing turns multiply-add operations into an Ohm’s Law exercise through a memory array. Weights remain resident as the conductance values in the array. Source: The Linley Group

Fig. 3: In-memory computing turns multiply-add operations into an Ohm’s Law exercise through a memory array. Weights remain resident as the conductance values in the array. Source: The Linley Group

There is really one main impediment to using analog in more places, as Heinig noted: “Analog techniques are much more complex in implementation.”

10. Using photons instead of electrons. While traditional developers fight with Moore’s Law to bulk up the electronic computing that can be done on a piece of silicon, a very few companies are turning to silicon photonics. This can lower energy consumption because the computing itself consumes no energy.

All of the energy used for photonic computation comes from the laser. That energy will be split and recombined in different ways in the computation phase, and some energy will be lost in the waveguides. But fundamentally, the laser is the only consumer of external energy. All of the rest of the circuits merely dissipate that energy, hopefully leaving enough in the output signal to be of use.

This puts a premium on creating an efficient photonics platform with low-loss waveguides. But there’s a limit to how exotic the platform can be, because today’s practical photonics chips will need to be compatible with CMOS so that fab efficiencies can be leveraged for lower costs (whether or not CMOS circuits co-reside with the photonics on a single chip).

Fig. 4: Photonic “circuits” use no energy for the computing functions. All of the energy is introduced through the lasers. Source: The Linley Group

Fig. 4: Photonic “circuits” use no energy for the computing functions. All of the energy is introduced through the lasers. Source: The Linley Group

11. Optimize for your hardware — and software. Tailoring an implementation to the capabilities of the underlying compute hardware also will lower power. This can mean the higher-level architecture or the lower-level details. “If you can tailor your pipeline to your application, that is one way to save power,” said Pandey.

The earlier this is done, the better. “Perhaps the most leverage comes from tailoring the network for the target hardware, and doing so at model development time,” said Roddy. “There are a wealth of techniques available today in the training frameworks that emphasize inference deployment on devices.”

In some cases, you might train the model first, using a rough cut just to get something working as quickly as possible. Once the hardware is decided, retraining for that hardware or compiling specifically for that hardware will result in a more efficient implementation.

While much focus is on hardware architectures, there is almost always some software component executing, as well. How that’s optimized — how loops are nested, for example — can affect energy consumption.

Conclusion
With traditional circuits, it’s typically understood that, in order to make things faster, you have to use more energy. Speed and power are endless combatants in this game. But in the case of lowering AI energy consumption, many of the techniques also contribute to higher performance.

For instance, by keeping data local or stationary, you save energy without suffering the latency adders of the data fetches. Sparsity reduces the amount of computation, meaning that you finish faster.

The ultimate example is photonics. Calculations will be possible with picosecond-level latency and high levels of parallelism can be implemented using wave division multiplexing, completing multiple streams of computation in parallel in that picosecond-level timeframe.

This is not a free lunch, however. Where power and performance don’t fight, the tradeoffs are likely to be cost and/or accuracy. One may well be trading off silicon area for the power and speed benefits. That said, money spent on silicon may pay for itself with lower energy bills.



1 comments

Ron Lavallee says:

Another way to reduce power and increase execution speed is to use a clock-less (asynchronous) approach to the design and execution of the AI system by using “Processing Circuits for Parallel Asynchronous Modeling and Execution “.

Leave a Reply


(Note: This name will be displayed publicly)