Exponential increase is not sustainable. But where is it all going?
Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable.
To a large extent, this is because the field is new, exciting, and rapidly growing. It is being designed to break new ground in terms of accuracy or capability. Today, that means bigger models and larger training sets, which require exponential increases in processing capability and the consumption of vast amounts of power in data centers for both training and inference. In addition, smart devices are beginning to show up everywhere.
But the collective power numbers are beginning to scare people. At the recent Design Automation Conference, AMD CTO Mark Papermaster put up a slide showing the energy consumption of ML systems (figure 1) compared to the world’s energy production.
Fig. 1: Energy consumption of ML. Source: AMD
Papermaster isn’t alone in sounding the alarm. “We have forgotten that the driver of innovation for the last 100 years has been efficiency,” said Steve Teig, CEO of Perceive. “That is what drove Moore’s Law. We are now in an age of anti-efficiency.”
And Aart de Geus, chairman and CEO of Synopsys, pleaded on behalf of plant Earth, to do something about it. “He or she who has the brains to understand should have the heart to help.”
Why is energy consumption going up so fast? “The compute demand of neural networks is insatiable,” says Ian Bratt, fellow and senior director of technology at Arm. “The larger the network, the better the results, and the more problems you can solve. Energy usage is proportional to the size of the network. Therefore, energy efficient inference is absolutely essential to enable the adoption of more and more sophisticated neural networks and enhanced use-cases, such as real-time voice and vision applications.”
Unfortunately, not everyone cares about efficiency. “When you look at what the hyperscaler companies are trying to do, they’re trying to get better and more accurate voice recognition, speech recognition, recommendation engines,” says Tim Vehling, senior vice president for product and business development at Mythic. “It’s a monetary thing. The higher accuracy they can get, the more clients they can service, and they can generate more profitability. You look at data center training and inference of these very large NLP models, that is where a lot of power is consumed. And I don’t know if there’s any real motivation to optimize power in those applications.”
But some people do care. “There is some commercial pressure to reduce the carbon impact of these companies, not direct monetary, but more that the consumer will only accept a carbon-neutral solution,” says Alexander Wakefield, scientist at Synopsys. “This is the pressure from the green energy side, and if one of these vendors were to say they are carbon neutral, more people will be likely to use them.”
But not all energy is being consumed in the cloud. There are a growing number of smart edge devices that are contributing to the problem, as well. “There are billions of devices that make up the IoT, and at some point in the not-too-distant future, they are going to use more power than we generate in the world,” says Marcie Weinstein, director of strategic and technical marketing for Aspinity. “They consume power to collect and transmit, and do whatever it is they need to do with all of this data that they collect.”
Fig. 2: Inefficiency of Edge processing. Source: Aspinity/IHS/SRC
Reducing power
In the past, the tech world relied on semiconductor scaling to make things more energy-efficient. “Our process technology is approaching the limits of physics,” says Michael Frank, fellow and system architect at Arteris IP. “Transistor width is somewhere between 10 and 20 lattice constants of silicon dioxide. We have more wires with stray capacitance, and a lot of energy is lost charging and discharging these wires. We cannot reduce our voltages significantly before we get into a nonlinear region where the outcome of an operation is statistically described rather than deterministic. From the technology side, I’m not really giving us a good chance. Yet there is this a proof of concept that consumes about 20 watts and does all of these things, including learning. That’s called the brain.”
So is ML more efficient than the alternative? “The power consumption of ML has to be put in the perspective of its application system, where the tradeoff hinges on the gain in overall performance from the inclusion of ML versus the power profile of the entire system,” says Joe Hupcey, ICVS product manager for Siemens EDA. “And within the many application domains, the industry has developed highly efficient ML FPGAs and ASICs to bring down the power consumption in training as well as inference, and there is much ongoing investment to continue this trend.”
There is one effect that may force more concern about power. “Some companies are looking at power per square micron because of thermal,” says Godwin Maben, Synopsys scientist. “Everybody is worried about heat. When you stack a lot of gates together in a small area, the power density is high, temperature goes up, and you approach thermal runaway. Power density is now limiting performance. As an EDA vendor, we are not just looking at power, because when thermal comes into picture, performance per watt, and then performance per watt per square micron, become important.”
There are several ways to look at the problem. “I usually like to look at energy per inference, rather than power,” says Russ Klein, HLS platform director for Siemens EDA. “Looking at power can be a bit misleading. For example, typically a CPU consumes less power than a GPU. But GPUs can perform inferencing much faster than a CPU. The result is that if we look at the energy per inference, GPUs can perform an inference using a fraction of the energy that a CPU would need.”
Where the most energy is consumed isn’t clear, and while that may seem obvious it turns out to be quite contentious. There are two axes to consider — training versus inference, and edge versus cloud.
Training vs. inference
Why does training consume so much power? “A lot of energy is consumed when you’re iterating over the same dataset multiple times,” says Arteris’ Frank. “You are doing gradient descent type of approximations. The model is basically a hyper-dimensional surface and you’re doing some gradient, which is defined by the differential quotient descending through a multi-dimensional vector space.”
The amount of energy consumed by doing that is increasing rapidly. “If you look at the amount of energy taken to train a model two years back, they were in the range of 27 kilowatt hours for some of the transformer models,” says Synopsys’ Maben. “If you look at the transformers today, it is more than half a million kilowatt hours. The number of parameters went from maybe 50 million to 200 million. The number of parameters went up four times, but the amount of energy went up over 18,000X. At the end of the day, what it boils down to is the carbon footprint and how many pounds of CO,sub>2 this creates.”
How does that compare to inference? “Training involves a forward and backward pass, whereas inference is only the forward pass,” says Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “As a result, the power for inference is always lower. Also, many times during training, batch sizes can be large, whereas in inference the batch size could be smaller.”
Where it gets contentious is when you try to estimate the total power consumed by both functions. “There is debate about which consumes more energy, training or inference,” says Maben. “Training a model consumes a huge amount of power, and the number of days it takes to train based on this data is huge. But does it take more energy than the inference? Training is a one-time cost. You spend a lot of time in training. The problem in the training phase is the number of parameters and some models have 150 billion parameters.”
Moreover, training often is done more than once. “Training is not a one and done and never come back,” says Mythic’s Vehling. “They continually retrain, re-optimize models so the training is a constant. They continually tweak the model, find enhancements, a dataset is enhanced, so it’s more or less an ongoing activity.”
However, inference may be replicated many times. “You train a model, which may have been developed for a self-driving car, and that model is now used in every car,” adds Maben. “Now we are talking about inferencing in maybe 100 million cars. One prediction is that more than 70% to 80% of the energy will be consumed by inference rather than the training.”
There is some data to support this. “In a recent paper from Northeastern and MIT, it is estimated that inference has a substantially greater impact on energy consumption than training,” says Philip Lewer, senior director of product at Untether AI. “This is because models are built expressly for the purpose of inference, and thus run substantially more frequently in inference mode than training mode — in essence train once, run everywhere.”
Cloud vs. edge
Moving an application from the cloud to the edge may be done for many different reasons. “The market has seen that there are certain activities that are better pushed to the edge rather than the cloud,” says Paul Karazuba, vice president of marketing for Expedera. “I don’t think there is a clear line of demarcation between what is, and isn’t, going to be done at the edge and how those decisions are going to be made. We are seeing a desire for more AI at the edge, we are seeing a desire for more mission critical applications at the edge rather than AI as a stamp on the outside of the box. The AI is actually doing something useful in the device, rather than just being there.”
It is not as if you take a cloud model and move it to the edge. “Let’s say you have this natural speech, voice recognition application,” says Mythic’s Vehling. “You are training those models in the cloud. And most of the time you’re running those models for inference in the cloud. If you look at the inference applications that are more in the edge, that are not cloud based, you train the model for those local resources. So it’s almost two different problems that you’re solving. One is cloud-based, the other is edge-based, and they are not necessarily linked.”
Models have to be built knowing where they will ultimately run. “You will usually find the multi-billion parameter models running in the cloud, but that’s just one category of models,” adds Vehling. “At the other extreme you have really tiny wake-up word models that take very low resources — call them tiny ml or even below that. And then in the middle are the category of models, such as visual analytics models, that you might see used in camera-based applications. They are much smaller than the models in the cloud, but are also much larger than these kind of very simple wake up word.”
And it is not just inference that is on the edge. We are likely to see increasing amounts of training. “Federated learning is an example,” says Sharad Chole, chief scientist at Expedera. “One area in which that has been used is auto-complete. Auto-complete for every person can be different, and how do you actually learn that? How do you tailor that? This has to be done while preserving the privacy of the user. There are challenges.”
Toward greater efficiency
Moving an application from the training system to the edge involves a significant software stack. “Once you get past the initial training phase, follow-on optimizations deliver significantly lighter weight models with little performance drop,” says Siemens’ Hupcey. “Model simplification techniques are used to reduce power consumption during inference. Quantization, weight pruning, and approximation are widely used after or during model training before its deployment. Two of the most visible cases in point are TinyML and the light versions of GPT-3.”
Adds Klein: “Drop-out and pruning are a good start. Quantizing to smaller numeric representation helps, too. Done aggressively, these can reduce the size of the network by 99% or more, and result in a drop in accuracy of less than 1% in many cases. Some people also look at trading off channels with layers in the model to yield smaller networks without impacting accuracy.”
These techniques both reduce the model size and directly lower the energy requirements, but more improvements are possible. “Right now we see support for mixed precision, where every layer can be quantized to different domain,” says Expedera’s Chole. “That could be pushed even further. Maybe every dimension of the weights can be quantized to different precision in future. This push is good, because then during the training the data scientists become aware of how they can reduce the power and what accuracy tradeoff they’re doing while reducing the power.”
Conclusion
Models are getting larger in an attempt to gain more accuracy, but that trend must stop because the amount of power that it is consuming is going up disproportionately. While the cloud can afford that today because of its business model, the edge cannot. And as more companies invest in edge applications, we can expect to see a greater regard for energy optimization. Some companies are looking at reductions of 100X in the next 5 years, but that is nowhere near enough to stop this trend.
Related
11 Ways To Reduce AI Energy Consumption
Pushing AI to the edge requires new architectures, tools, and approaches.
Shifting Toward Data-Driven Chip Architectures
Rethinking how to improve performance and lower power in semiconductors.
How To Optimize A Processor
There are at least three architectural layers to processor design, each of which plays a significant role.
When in history has the world’s energy production ever remind flat? Flat for 50 years – worst strawman ever.
We need to get AI more power so it can develop Dyson sphere technology. Otherwise it will discover that human are too unreliable to be dependent on for their power needs.
Worldwide electricity production has been increasing at about 3% a year, so compared to the consumption of AI/ML – it is flat.