中文 English

Thinking About AI Power In Parallel

Next steps in adding efficiency into complex systems.

popularity

Most AI chips being developed today run highly parallel series of multiply/accumulate (MAC) operations. More processors and accelerators equate to better performance.

This is why it’s not uncommon to see chipmakers stitching together multiple die that are larger than a single reticle. It’s also one of the reasons so much attention is being paid to moving to the next process node. It’s not necessarily for the performance improvements per transistor. It’s all about the number of processing elements that can fit on a die. The smaller the feature sizes, the greater the number of compute elements.

Until now, most of the concern has been about massive improvements in performance, sometimes as high as 1,000X over previous designs. Those numbers drop significantly as the chips become less specialized, which always has been a factor in performance — even back to the days of mainframes and the Microsoft-Intel (WinTel) duopoly in the early days of PCs and commodity servers. Performance is still the key metric in this world, and there is so much dissipated energy in the form of heat that liquid cooling is almost a given.

But power is still a concern, and it’s becoming a bigger concern now that these AI/ML chips have been shown to work. While the initial implementations of training were done on arrays of inexpensive GPUs, those are being replaced by more customized hardware that can do the MAC calculations using less power. All of the big cloud providers are developing their own SoCs for training (as well as some inferencing), and performance per watt is a consideration in all of those systems. With that much compute power, saving power can add up to massive amounts of money, both for powering and cooling the hardware and for preserving the lifespan and functionality of these exascale compute farms.

On the edge, where the majority of the inferencing will be done, the benefits of power efficiency are just as tangible. If an electric vehicle wastes energy on computation, mileage per charge goes down. It’s the same for a mobile device or anything with a battery, and in enterprise data centers (on-premises or off-premises), the amount of energy spent per transaction can be a major cost for small and midsized businesses.

One of the initial problems in all of these systems is they were designed for always-on circuits. They assumed that the amount of data that needed to be processed would be enormous, which in many cases was a good assumption. The problem is that not all of that data needs to be processed all the time, and not all of it is needed immediately. So while massive increases in performance are possible in many of these chips, it’s not always necessary. And from a power efficiency standpoint, that’s not always a good thing.

The challenge now is how to begin making huge gains in efficiency without losing that performance, and that leads to a couple of changes. First, data and processing need to be partitioned in a more intelligent manner, so that compute functions can ramp up and scale down quickly. This is done all the time in phones, where the screen dims when it is next to your face and lights up when it is not, but it’s more difficult when that includes thousands of processing elements fine-tuned to parse computations based on very specific algorithms.

The second challenge is to understand what data really needs to be processed and where, and this raises power engineering to the system level. It’s certainly not necessary to process all streaming video all the time. It’s not even necessary to move most of that data anywhere but into the trash. But that means end points need to be smarter, and that requires new algorithms and much more hardware-software co-design at the end point, where so far there has been comparatively little progress.



Leave a Reply


(Note: This name will be displayed publicly)