Rethinking Machine Learning For Power

To significantly reduce the power being consumed by machine learning will take more than optimization, it will take some fundamental rethinking.

popularity

The power consumed by machine learning is exploding, and while advances are being made in reducing the power consumed by them, model sizes and training sets are increasing even faster.

Even with the introduction of fabrication technology advances, specialized architectures, and the application of optimization techniques, the trend is disturbing. Couple that with the explosion in edge devices that are adding increasing amounts of intelligence and it becomes clear that something dramatic has to happen.

The right answer is not to increase the world’s energy production. It is to use what we have more wisely. The industry has to start taking total energy consumed by a machine learning application seriously, and that must include asking the question, ‘Is the result worth the power expenditure?’ This is as much an ethical question as a technical or business one.

The vast majority of computational systems in use today borrow heavily from the von Neumann architecture in that they separate the notions of compute and memory. While this makes for very flexible models of computation, optimized design, and manufacturing, it has a significant downside. Transfer of data in and out of the computational core often consumes more power than the computation itself (see figure 1). If you add the power consumed by the CPU cache to the I/O power, most of which is communication to memory, add that again for the other end of the I/O in the memory sub-system, and the power consumed by the memory itself, the total power consumed by memory swamps that used for computation. While this is for a CPU, the figures are similar for GPUs and ML processors.

Fig. 1: Power consumed in a server SoC. Source: AMD/Hot Chips 2019

Fig. 1: Power consumed in a server SoC. Source: AMD/Hot Chips 2019

In the short term, attempts to reduce the power consumption of machine learning systems have focused on reducing the data transfer power, but longer term we need to look at more radical changes, such as merging computation and memory or changing their structure. Perhaps there is more that we can learn from the brain. It only consumes 20W while a comparable machine of today consumes orders of magnitude more than that.

“Neurons in biology store the weights inside the neurons,” says Michael Frank, fellow and system architect at Arteris IP. “When these are turned into virtual neurons for a compute-based neural network, the weights are stored somewhere in memory, and you have to move them. But the movement of data costs you more energy than the compute by itself. It is definitely not a good idea to have a centralized storage for your weights. You probably want to have something where the memory is smeared in between the compute cells.”

There are several ways to look at this problem. “A lot of research is being done in the hardware space,” says Godwin Maben, Synopsys scientist. “How can I bring the memory close to the compute? Different types of memory impart various latency penalties, so how do I merge compute and memory together to reduce this? Do I have to stick to a transistor-based memory? Can we look at different types of stress-based memory, or MEMS? Should I use the typical 6T memory or can I go 1T memory? A lot of things are in motion, but we have not reached any conclusions.”

Today, most of the efforts are related to physically bringing the memory closer to the compute and where possible putting enough inside the package such the I/O costs are reduced. “By placing the AI inference processing elements next to distributed memory, we’re seeing a 6X power reduction in our processor family compared with von Neumann-architecture processors on the market,” says Philip Lewer, senior director of product at Untether AI. “We believe that the at-memory architecture focused on minimization of data movement, in contrast to von-Neumann architectures, has benefits for both training and inference especially as silicon providers continue to move to smaller process geometries.”

But this alone is not enough. We need orders of magnitude improvement.

Analog versus digital
While we live in an analog world, everything is processed in digital today. “We bring it into the system, we digitize it immediately, and then we do with it what we’re going to do,” says Marcie Weinstein, director of strategic and technical marketing for Aspinity. “That includes any type of intelligence and machine learning and AI and anything else we need to figure out about it. We do that once you’ve already digitized the data. But we live in an analog world, and everything physically happens in analog. What happens if we bring some of that intelligence into analog so that we can decide earlier, before we digitize it, if it is actually worth digitizing and sending to that next stage? In some cases, the entire machine learning workload of a system can be shifted into the analog domain.”

Thinking of analog solutions means fundamentally rethinking architectures. “One of the foundational ideas of analog is you can actually compute in the memory cell itself,” says Tim Vehling, senior vice president for product and business development at Mythic. “You actually eliminate that whole memory movement issue, and therefore power comes down substantially. You have efficient computation and little data movement when analog comes into play. With analog compute in memory technology, it’s actually orders of magnitude more power efficient than the digital equivalent.”

Arteris’ Frank agrees. “The real way to go for neural networks is to go analog, because digital computation relies on your signals having a good signal to noise distance from kT, the thermal noise. If you look at where our brain operates, our brain operates much closer to kT. All the voltages in the brain are far below what we use in transistors, and 1/40th of an eVolt is the equivalent of room temperature.”

An always-off mentality
The number of smart devices in the home and in industry is increasing rapidly, and many people don’t think about power consumption of these devices — particularly if they are not powered by a battery. “If you look around the house, look at how many items are actually plugged into the wall in standby mode, all taking 5 or 10 watts,” says Alexander Wakefield, scientist at Synopsys. “There are ways to solve this problem, because it is a very low-tech problem. But it hasn’t been solved because the cost of energy to one person is quite low. I have 20 edge devices in my house that are all waiting for the wake word, consuming power, but individually they don’t use that much. When you have 20 billion of these things or 100 billion of these things worldwide the total impact is quite large.”

Perhaps there is more inspiration that can be taken from the brain. “There are times where it is beneficial to think about a more cascaded machine learning approach — where there’s something you could know about the data first, before you have to go to a more complex processor,” says Aspinity’s Weinstein. “For example, in a speech recognition system, maybe you want to know there’s a human voice first. The wake word engine doesn’t need to listen for a wake word when the dog barks. It only needs to listen for the wake word when there is actually someone speaking. If the chip can keep everything off and just detect, at very low power, that we’ve got voice, then the next system wakes up to see if they’ve said Siri or Alexa. That is a more efficient way to use the processors that we have within a system.”

This is a different kind of partitioning problem. “Instead of processing using one single neural network on top of sensor data, we need to consider more of a model pipeline,” says Sharad Chole, chief scientist at Expedera. “Multiple neural networks processing a single frame of a video, or an image — the use cases there are interesting. You might segment the image, you might find interesting regions, you might detect something, you might blur the background. Those use cases are interesting, and they are growing. They are limited by what applications you can deploy, and what constraints, or what a market segment might push for in terms of innovation. These are fundamentally the reasons why AI at the edge is going to grow. It’s the resolution of the sensors and the applications that are using model pipelines, combining them into something more coherent or more sophisticated application experience.”

And it doesn’t matter if these devices are plugged into the wall or running from battery. “There’s a bigger problem out there than just how many batteries go into a landfill,” says Weinstein. “It’s also about how much energy we are using. Implementing some of these technologies that are improving the efficiency of devices plugged into the wall is going to become very important, because we are going to run out of energy no matter what form it comes in.”

How accurate
Another aspect of a cascaded processing pipeline could relate to accuracy. “How do I use the smallest model possible in the inference engine to make decision and make a tradeoff between accuracy versus the energy that is needed?” asks Synopsys’ Maben. “In inference, you don’t need 100% accuracy all the time. If I choose to lose 1% accuracy in inference, I can save power. The loss of accuracy in one application, by going from 95% accurate to 94%, has been shown to save more than 70% power or energy. Is that 1% accuracy, good or bad? It depends on the application. That 1% accuracy loss is really bad for a self-driving car, but 1% accuracy for me in terms of asking something to Alexa should be okay.”

Model size and accuracy tend to go hand-in-hand today. “If power is your number one concern you can choose to run a smaller model, or a less accurate model,” says Mythic’s Vehling. “You could prune models so that they fit into smaller pieces of memory, or you can buy a chip that’s not as powerful, but maybe it gets the results you want. What ends up being interesting is that you always hear about benchmarks, and frames per second, or latency, or whatever, but in the end the question is: does the model solve my problem? Then it comes down to does it have the frame rate I need. Does it have the accuracy I need? Maybe the accuracy doesn’t have to be the top 1%. Maybe it can degrade a little bit so I can run a smaller model, or I can run a lower power model. You can do a lot of tradeoffs in the edge market with model size, model accuracy, model frame rate, etc.”

But models are getting larger. “The general trend in neural networks is toward bigger networks with more parameters,” says Russ Klein, HLS platform director for Siemens EDA. “This is exactly the wrong thing to do. Larger networks do marginally improve accuracy, but at diminishing returns. A model may have triple the number of parameters for a fraction of a percent increase in accuracy.”

Data precision matters has an impact, too. “A 64-bit floating point multiply is one of the most energy-intensive operations you can do in a processor,” adds Klein. “Using integer math, often on fixed-point numeric representations, uses about half the energy of an equivalent floating-point operation. And using smaller number representations has lots of benefits. The energy used in a multiplication is proportional to the square of the size of the operands. Thus, an 8-bit integer multiply will use less than 1% of the energy of a 64-bit floating point multiplication. If weights and features can be reduced to 8-bit or 16-bit fixed point numbers, not only do the operations consume less energy, but the storage and movement of the data becomes both more energy efficient and faster (i.e. more values moved per bus cycle).”

How does this translate into analog, which inherently has infinite precision but variability? “We are dealing with analog signals, which are very accurate,” says Weinstein. “We’re not losing any information when looking at analog signals, like you are when you convert it to digital. But analog does suffer from variability. The way we overcome that is exactly the same way we program the parameters and configure the circuit elements. We are able to trim out the variability that comes from small manufacturing variability. This is well categorized in analog and CMOS manufacturing. So we can program a chip, and we can tune the chip so that across millions of chips, we end up with chips that provide the same answer.”

Forget dataflow
There is at least one other fundamental difference between today’s ML systems and the brain. “The brain uses spiking neural networks,” says Frank. “These have models that are relatively simple, yet they do a lot of high-precision representation of knowledge. They use much less power, and they need significantly smaller number of coefficients to represent information.”

Spiking neural networks are another kind of off-until-required system. “With a spiking neural network, you only perform on a very small time component, where you do your processing and then go back to sleep,” says Vehling. “It is a very power-efficient approach. Spiking neural network processors, in conjunction with the classic architecture, could be very powerful. Considered a sensor that wants to wake up when somebody walks by. Then it may try to identify a person or detect a person. The sensor is coupled to the spiking neural network, which detects the person in the frame. That then cascades and wakes up the next layer of processing.”

There are several avenues that this could take. “Spiking neural networks are adaptive,” says Frank. “They could continue learning in the same structure. It also shows that you do not have to implement the entire brain to implement a certain set of functions because we’re not using the entire brain for example to see and recognize certain patterns. We are using a very small part of the brain for that.”

While much of what has been discussed has concentrated on inference at the edge, there are many lessons that can, and should, be fed back to the model developers. “Knowledge distillation is a promising technique to significantly reduce model size without impacting accuracy,” says Klein. “But the network architectures themselves have to change, with a focus on smaller networks. Reducing the size of the parameter database impacts all the major power consumers, and performance limiters, in inference processing.”

Conclusion
The power consumption of ML devices should scare everyone. This is an ethical responsibility as much as a technical one. To solve some of these problems, the industry has to move outside of its comfort zone and start looking at fundamentally new architectures.

Chip and system design are driven by the excitement of what may be possible, but that has to be tempered by the value that the world gets from some of these technologies, not just how much money can be made from them.

Related
Near-Threshold Computing Gets A Boost
Why many leading-edge designs are beginning to look like near-threshold designs.
New Uses For AI In Chips
ML/DL is increasing design complexity at the edge, but it’s also adding new options for improving power and performance.
Distilling The Essence Of Four DAC Keynotes
Common themes emerge, but so do different ways of looking at a problem, from business opportunity to concern for the environment.



Leave a Reply


(Note: This name will be displayed publicly)