So far there is little data to show what works best and why.
AI and machine learning have voracious appetites when it comes to power. On the training side, they will fully utilize every available processing element in a highly parallelized array of processors and accelerators. And on the inferencing side they, will continue to optimize algorithms to maximize performance for whatever task a system is designed to do.
But as with cars, mileage varies greatly depending upon your driving habits. If you press heavily on the accelerator pedal, then fuel goes down faster than if you drive more conservatively. And if you drive at 75mph, then you burn up fuel faster than at 55mph. The same is true for semiconductors.
It’s not exactly clear how this will play out in AI/ML/DL chips, because no one really knows how they will perform over time. The whole idea behind AI is adaptation within an acceptable parameter of behaviors, but there isn’t much data to support this. The semiconductor industry has barely scratched the surface when it comes to understanding the efficiency of these systems, which may include multiple AI processors.
The problem is that standard rules for utilization of processing, memory and I/O don’t apply when it comes to AI-related designs. At times, when there is little data input, these systems may be virtually idle, but at least some circuitry is always on. The challenge for designers is to figure out what has to ON, what can be in the OFF state, and what can be in some state of sleep. That can vary by location, application, and whether the processing elements have been optimized for the latest tweaks to an algorithm, which are constantly being updated.
This also varies greatly by the types of processing elements used, what the datapath is between processor and memory, how much data is being generated, and how all of that data is handled by the various processors and accelerators in an AI system. Ironically, it may be the software that has the most impact on power, and that has the fewest tools available for predicting system power consumption.
While the hardware engineers have focused heavily on more efficient processing — whether that involves wide buses to move data back and forth to memory and/or moving the memory closer to the processor cores — the bigger knobs to turn in terms of efficiency are cramming more bits per cycle into a process, pruning the algorithms, reducing precision wherever possible, and getting rid of useless data at or near the source. Software also needs to control the partitioning of various computations across different cores, and how much of the chip is needed to do the job while balancing performance and power.
There are other options, as well, such as spiking neural networks, and structuring data at the source of multiple inputs so it can be processed more easily in a centralized location. But what impact all of these steps have on power is guesswork. Intuitively, it all makes sense from a hardware/software co-design standpoint to throw as many power-saving options at the problem as possible. But at this point, there is scant data to support any of this.
Hi Ed,
I’m glad that at the end you mention spiking neural networks. Brainchip is producing a neuromorphic chip with event-based convolution that runs at extremely low power, while maintaining high performance. We do this by eliminating the software component. the network is entirely implemented in hardware, but these dedicated cores are arranged in python. Once arranged there is no software running in the network. Cores can be fully connected, separable convolutional, standard convolutional, with configurable number of inputs and outputs. We have done Imagenet classification at up to 58 images per second, but also fully spiking odor classification. power consumption depends only on the number of spikes that go through the chip, and varies from microwatts up to milliwatts. The AKD1000 chip is now in production, and has one-shot or multi-shot learning as well as convolution based and / or spike based inference. The chip is fully digital in 28nM TSMC process, but is process agnostic.