Designing Ultra Low Power AI Processors

Approaches and techniques are changing as more processing moves to devices powered by a battery.


AI chip design is beginning to shift direction as more computing moves to the edge, adding a level of sophistication and functionality that typically was relegated to the cloud, but in a power envelope compatible with a battery.

These changes leverage many existing tools, techniques and best practices for chip design. But they also are beginning to incorporate a variety of new approaches that emphasize more interplay between hardware and software, as well as new research about how to improve the efficiency of those computations. Projections call for these new architectures methodologies to complete many more operations per second and per watt than what they are capable of achieving today.

“The first goal for a new AI chip design, as with any new design, is functionality, where power is restricted to a metric that is measured only once the gate netlist is ready,” said Preeti Gupta, head of PowerArtist product management at Ansys. “Chip architects are quick to realize that high power consumption has a dire impact on the functionality and profitability. Also, the power distribution network may not be able to sustain the hundreds of watts of power consumption, so the package may need extra metal layers leading to a significant increase in cost, while the thermal impact can span the chip, the package and the system.”

Many of these devices are now required to do more computing using the same or smaller battery.

“If you look at the early consumer AI types of things, you think about Alexa or Siri, where the processing is all done in the cloud with very little local processing,” said Rich Collins, product marketing manager for IP subsystems at Synopsys. “Now, there’s a need to push AI down to the actual edge device, whether it’s your Fitbit, smartwatch, or even cell phone. Everything requires more local processing — or at least as much of the local processing that can be done locally — which means the power consumption issue is now relevant. If it’s going into a data center, it’s still important, but it’s less important. If you’re talking about your watch that only holds a charge for 24 hours, you can’t have a high-power processor.”

This isn’t just confined to wearable electronics, either. The same kinds of issues are surfacing in increasingly autonomous automotive systems, and in surveillance cameras that include voice recognition, gesture recognition or face detection. “More and more, these are being addressed by neural network engines, and engineering teams want to see this idea that they can build a frame on a network engine to process what’s required for the applications,” Collins said.

All of this needs to be baked into the design flow as early as possible, using power-smart methodologies that provide visibility into power consumption and profiles. Early analysis at RTL and above, for example, allows power-related design decisions that have a high impact. Later in the design flow when a netlist is available, however, power-related changes are locked within the high-level architecture and can only provide incremental benefits, Gupta said.

What’s different
Not all of this is straightforward, however. AI adds a new set of challenges. Rapidly changing algorithms mean that design engineers must analyze power weekly and often daily. In addition, many of these designs are physically large — sometimes exceeding the size of a reticle. As a result, tools need to apply big data and elastic compute capabilities to read the full design. Some of that can be done on commodity machines, but not all.

“One would think that for a design with repeated structures, you could optimize one of the structures and replicate,” Gupta said. “That is true. However, some of the most severe power ‘bugs’ can occur in block-to-block interactions that are only visible in the context of the full design. Turning off clocking for an entire block can have a much larger impact on power than clock gating a single register bank.”

AI inferencing also can be prone to glitch power due to extensive arithmetic logic. RTL-based approaches have emerged that can highlight glitch-prone areas early in the design flow. Further, when using RTL activity data with a gate netlist, it is important to have a power analysis framework that can propagate events to compute glitch power instead of statistical activity propagation approaches that are prone to error. Event propagation is expensive, and using massively parallel approaches in big data systems is required, Gupta said.

While there isn’t one trick to achieving low power in a design, there are a couple of key points that are specific to machine learning.

“In many ways, ML processing is more a data management problem than a ‘processing’ problem,” said Dennis Laudick, vice president of marketing, machine learning group at Arm. “The data involved in ML processing can be very large, and if you don’t focus on efficient data movement, you could easily waste 100X more power moving data around and then processing it. Techniques such as tailored compress, quantization, pruning, clustering, minimizing processor data reloads, etc., are absolutely critical.”

In addition, ML is still a very dynamic area of research and there’s a challenge around how much processing to implement in highly fixed and highly efficient hardware, versus flexible, future-proof but less efficient general-purpose processing, he said.

Fig. 1: Arm’s low power AI chipset. Source: Arm

“What makes power such a challenge to get right in an application like a doorbell camera is if you look at the power envelope of the system, it’s looking at the images and identifying the patterns of the images, creating the network,” said Anoop Saha, market development manager at Mentor, a Siemens Business. “It’s a lot of computation, and a lot of memory accesses. What happens in the market today is that the design team will take a card from Renesas or NXP or Arm or another provider. There is a simple card where there’s a CPU and some other things like a generic CPU, a generic compute element, and a generic memory element. There are some specific things built around AI, but most of the cards that exist in the market are general-purpose. That’s the only way those companies can make profit. These cards have to be general-purpose so they can apply to a wide variety of cases. However, because it’s general-purpose, it’s not specialized to a use case that a specific company is working on. That’s one of the reasons that the way the memory is organized, and the way the application accesses the memory, the power envelope is like 30 or 40 times more than what they would like it to be. It’s not like it would be if it were going into a data center, but it needs a constant power connection. Otherwise, the battery will die very soon. They are using a general-purpose chip for doing a very specific AI application, but it’s a different way of computing and different way of doing things, so that shoots up the power envelope of that chip. The more you specialize, the better it is.”

The traditional options for specialization involve ASICs and FPGAs, but both of those have constraints that can make them unsuitable for an ultra-low-power AI processor.

“FPGAs are relatively slower than many options,” Saha said. “They can have a very high power envelope, so you may save some money, but then you don’t. You save some power, but not enough to justify. There is also a cost. If an IoT device is being done in volume, then FPGAs are not the right solution. The problem with using an ASIC is that you want it to be specialized for a particular use case. This is an added cost factor. If it is not at volume, there is a tradeoff. The company has to decide whether they invest in that chip development, or whether they should buy something that’s general-purpose from the market. If they do an ASIC, they should be able to adapt it, add new network layers, change the weights later on. Also, it should be fairly robust. They need to determine if the ASIC they are designing is the best ASIC they can do for these types of applications. That requires a lot of verification and early power computation. It needs paradigm of everything, including how you design a chip, how you verify it, whether you have the right metrics before you send the chip for tape-out, and you have to be able to do it at volume without a huge budget. All three of those things need to align.”

What’s new in AI processors
Many of the existing AI processors aren’t all that much different from other types of processors. But that’s starting to change, fueled by a tremendous amount of research across multiple market segments and the widespread adoption of AI/ML/DL for a number of new and often unique applications.

“People are coming out with a lot of innovative AI models, so any IP vendor needs to be able to take those different models or neural networks that people generate and give them an ability to quickly run that on the hardware,” said Pulin Desai, group director, product marketing and management for vision and AI DSPs at Cadence. “But design teams need some flexibility for the future so that if something new gets invented they can bring that up quickly.”

What’s also common to AI processors are the power management techniques being employed today, said Synopsys’ Collins. “You’re looking at things like clock gating and power islands, since power management is critical. If you look at the programming paradigms on embedded vision processors, for instance, there are certain parts of the programming on the RISC core. That’s part of the vision engine. Then, these neural networks can be pretty big, depending on the number of multiply-accumulate units that are in it. But in general, power gating is still a huge part of any architecture.”

Nevertheless, he points to a subtle shift where the emphasis is on understanding the algorithms, then determine what can be gated off based upon that understanding, regardless of whether that includes power or clocking. “The concepts are the same. It’s just a matter of trying to understand the application space and where it makes sense to be able to leverage that,” Collins said. “If you’re looking at embedded vision for ADAS in the car, it’s a pretty high-performance piece that can consume a lot of power. Anything you can do to tweak that is going to be helpful.”

Other changes are not so subtle. “With a lot of new devices, what we’re seeing is a focus on how much power they are using and what new topologies can be used to make them more efficient,” said Andreas Brüning, head of the department for efficient electronics at Fraunhofer IIS’s Engineering of Adaptive Systems Division. “There are new technologies and research into what is the right material for a device. We’re also seeing research around the conversion from analog to digital to gain more flexibility around processing of data. In the IoT area, there is more and more interfacing of more than one sensor to the data. We’ve been working on a version of a random number generator with more efficient power consumption.”

However, not all challenges to achieve the best architecture for an ultra-low power AI processor are unique to AI, Collins said. “How much functionality do you want within a certain power envelope? You want your smartwatch to do all these things, but then you complain that the smartwatch only has two hours of battery. It’s that tradeoff of wanting it to do everything, but within a reasonable power envelope — especially for battery operated devices.”

What is unique is the fact that design teams are much more open about looking at options that were largely ignored in the past in order to squeeze every bit of power and performance out of a design. So while the need to reduce margin at advanced nodes has been talked about for years, the ability to actually do something about it was dispersed across different parts of the design through manufacturing chain. So while the techniques and technologies may be familiar, the precision with which they are used is only starting to be well-understood.

“The key is where you are going to monitor these conditions in these devices, because when you try to measure something you may be displacing some of that inherent information,” said Stephen Crosher, CEO of Moortec. “The conditions in an AI device are very different. It may consist of hundreds of cores, or even hundreds of thousands of cores. But the workloads are very bursty as they run algorithms, and what we see is that you often cannot deliver enough power to have all cores running at once. So you never actually achieve full utilization. The solution is to make the most of the power that you can deliver to the device.”

The other side
AI is quickly moving from buzzword to reality. “Our challenge is the idea that customers across the board now want to get in on this sort of thing, whether they’re doing basic IoT devices or cloud-based things. We’re constantly trying to create a unique look and feel so that everything is easy for our customers because they may have a project that’s at the low end, and then maybe in the future they’ll have another project at the high end. They don’t have to re-learn a whole new architecture and a whole new programming paradigm. They want to leverage a similar look and feel, and that’s what we’re spending quite a bit of time and effort on right now,” Collins said.

That approach appears to be gaining steam among fabless AI chipmakers. Steve Teig, CEO of Perceive, a startup that is developing an advanced neural network that runs at 20 milliwatts, said the most interesting, complicated and exciting part is the machine learning technology itself.

“While I’m proud of the hardware, the hardware is really an embodiment of a couple of years of really novel research and novel rigorous mathematics to figure out how should we think about that computation differently,” Teig said. “If you’ve looked at the jillion other companies doing inference with neural networks, both big and small, everybody has a large array of MACs. We don’t. We believe it is a red herring that you need an array of MACs to do neural network processing. If you think about the computation the neural network is doing, you don’t have to write down that computation in the obvious way. It’s being willing to take that plunge and working out all the math to make it possible.”

Big systems companies are developing their own custom hardware, as well. “A number of companies are now trying it out, and it’s not just the traditional chip companies or traditional OEMs,” said Mentor’s Saha. “The companies doing AI chips now, like Alibaba, Google and Facebook, are designing their own chips. This has a domino effect on a lot of things. What are the nodes? Will it all be TSMC 7nm little chips, or is there a higher node that can be used for these? In some examples, even a GlobalFoundries 22nm chip can save at least 10X power compared to a general-purpose compute at a lower node. Developers can choose which node is best based on their specific factors.”

He noted that nearly all universities are researching these issues today. In fact, it is the most dynamic period for chip development in the last couple decades.

“The single most important dynamic happening, however, is the coming together of different domains,” he said. “The hardware guys, software guys, architecture guys — they all work together as part of one team reporting to one manager. That didn’t happen before. A software engineer would write the algorithm. The architect would take the algorithm and create the architecture, and pass it to the design team. Then the verification teams was in a completely different domain. There is a very interesting organizational evolution happening, where a small team of 10, 15, 20 members will do a chip in eight months. All of us will work together — the design team, the verification team, the architecture team and the software team will report to the same hierarchy, have regular meetings and decide things together.”

A different path forward
The marriage of multiple disciplines in the same design has been discussed for decades, but to be competitive in the low-power AI space it is getting a lot of top-down buy-in for the first time.

“It means that you need folks within the team who deeply understand software development and hardware development and machine learning,” said Perceive’s Tieg. “Those are all very specialized disciplines. It’s rare to find people who are expert at all of them, and it’s rare to find a team that’s so intimately connected all those pieces. We deliberately built a team with that exact orientation in mind, people from my past, some people from [parent company] Xperi and others, so we have a bunch of hardware experts and software and ML experts, and enough people with one foot in each of more than one of those disciplines to be able to build that ensemble together. Otherwise, you end up with software people designing hardware, and that doesn’t go well, or hardware people designing software, and that gets even worse.”

—Ed Sperling contributed to this report.

Leave a Reply

(Note: This name will be displayed publicly)