Optimizing Power For Learning At The Edge

Making learning on the edge work requires a significant reduction in power, which means rethinking all parts of the flow.


Learning on the edge is seen as one of the Holy Grails of machine learning, but today even the cloud is struggling to get computation done using reasonable amounts of power. Power is the great enabler—or limiter—of the technology, and the industry is beginning to respond.

“Power is like an inverse pyramid problem,” says Johannes Stahl, senior director of product marketing at Synopsys. “The biggest gain is at the top, with the algorithm and architecture. If you do not start to consider it there, then you have no chance of solving it by switching a library element at the bottom.”

With an application like machine learning, there is also an interesting middle ground. “Artificial intelligence (AI) chips instantiate logic multiple times,” says Preeti Gupta, director for RTL product management at ANSYS. “We have seen chips that had around a thousand cores. That is a lot of repeated logic. While for a high-end CPU there may be a power reduction that leads to a small amount of savings, the designer may decide it is not worth their time. But with AI, they do go an extra step in addressing these power challenges because of the massive amounts of replication. So if you save a small amount of power for one block, that turns into a lot of power when you instantiate it so many times.”

Some of the tradeoffs are a little different when compared with traditional chips. “The major challenge with all of these designs is that they need lots of compute,” says Ron Lowman, product marketing manager for artificial intelligence at Synopsys. “They can add more compute, more parallelism, but it is actually a compute and memory problem. It is about bandwidth and fighting the bottleneck to memory. It requires a co-design approach. In the past we talked about designing hardware and software together, and AI is now forcing that. It is also forcing memory and processors to be co-designed. These co-designs are critical.”

This leads to a variety of approaches being used to reduce power, and successful companies have to address them all.

Starting at the bottom
While the biggest gains may be at the top, a number of techniques can help at the bottom when particular needs of these chips are considered.

“An increasing number of companies are looking to implement dynamic or static voltage scaling in their inferencing chips,” says Hiroyuki Nagashima, general manager of Alchip USA. “To support voltage scaling, it involves library re-characterization, process monitor development, silicon calibration, power supply isolation, level shifting and multi-scenario timing sign-off…a lot of extra design efforts. A proper voltage scaling design enables customers to achieve their desired computing power without wasting energy.”

Some particular cells are also under the microscope. “We have seen people doing SRAM optimization,” says Synopsys’ Lowman. “They are doing this just for leakage. That are considering very low voltage capabilities that you would only think of for energy harvesting chipsets. We have also been doing customizations in SRAM to optimize the densities for the bitcells so they can fit more SRAM on chip. This enables them to put more compute on chip with less latency. These are the types of techniques that are being used.”

Memory is key when it comes to AI and ML chips. “The way you architect the memory, the way you customize and optimize that architecture, the way you place it, where you place it and how you place it will all make a difference,” says Patrick Soheili, vice president of business and corporate development for eSilicon. “If you access DRAM through DDR, you will spend 100X more energy transferring data from the chip into the DDR. If you bring it into the package in the form of an HBM device in stacks of DRAM on the same substrate as the chip, you will save a lot of energy. But that is still 10X to 100X more energy than if you could put that memory inside the chip. Then, when it is inside the chip, by optimizing both the placement and the architecture and thinking more actively about how to place that memory you will save more energy because you are thinking about the datapath. You are thinking about the flow of information, how often it is accessed, and how far do you have to go to access it.”

Even when the memory is on-chip, care has to be taken with placement. “For AI inferencing, the power consumption of on-chip data movement is up to 40% of the power for the whole chip,” says Alchip’s Nagashima. “At layout level, hundreds or thousands of memories are placed next to processing elements (PEs), and they need to be as close as possible to reduce data movement distance, and thus the interconnect power consumption. Signal routings should also be analyzed carefully and match the data movement flow direction. The power of memory itself should also be reduced by using techniques such as low voltage, dual rail technologies that apply different voltages to periphery and bit array.”

Ignoring some of these issues can lead to unexpected limitations in the chips. “From a supply perspective, due to interconnect variability and aggressive workload changes across the multitude of cores, we see issues associated with dynamic IR drop and transient events, which starves logic of its supply temporarily,” says Stephen Crosher, chief executive officer for Moortec. “Therefore, designers are having to consider how software can help to control the spread of workload in an even fashion. To be most power-efficient, there needs to be a tight feedback of the dynamic conditions from within the chip to the software. By using an accurate, responsive embedded monitoring fabric it is possible to gain a better control, bringing a more balanced thermal and voltage profile across the device—hence, optimizing performance, minimizing power consumption, and helping to minimize reliability issues that may otherwise occur.”

It all comes back to hardware/software co-design and processor/memory co-design.

Starting from the top
For traditional designs, looking at power generally has started with RTL. “There is a huge opportunity at the system level,” says ANSYS’ Gupta. “How do you architect a chip where you are playing with these massive blocks and you have to put them together in a way that optimizes power? Most power reduction techniques concentrate on the register level, clock and data gating. We want to abstract to a high level and provide visibility to the designers so they can make better decisions.”

That requires a change of design flow. “Today, even before they think about RTL, they have to start with high-level modeling,” says Synopsys’ Stahl. “They have to consider the mapping of the algorithms to these high-level architectural models. As they do this, they can also model power consumption and do some tradeoffs. So, they should know if they are going in the right direction or if there are problems with their architectural choices.”

There are several ways to look at architectural optimization. “When we teach customers to do architectural exploration, we often start with the memory bottleneck,” adds Stahl. “Everything around the memory subsystem or the interconnect is just traffic and can be modeled at the high level. So we concentrate on memory throughput and how much data they can pump in and out.”

And then there is the processor/memory co-design. “Every time you do any transaction on the multiply-accumulate blocks (MAC)—from the memories and back, or moving data around—you consume power,” says eSilicon’s Soheili. “So first you have to minimize that architecturally. Second, you can optimize and customize the circuitry to the extent that you can minimize power wherever you can. Many of these are architecturally dependent. Sometimes you do have to give up utilization of the MACs or relax latency. Building a neural network by hand, as opposed to synthesizing one, or picking up things off the shelf without any thought, has a big impact. If you can use half-precision or quarter-precision or fixed integers, that all makes a big difference. And they are all architectural decisions, where you have to figure out exactly if it hits the target budget or not.”

Tools that help to make architectural choices can be highly valuable. “One of the big challenges of building custom hardware solutions is that you may need to try multiple combinations of different architectures with different precision to find the best tradeoff between power, performance, and area (PPA),” says Mike Fingeroff, high-level synthesis technologist at Mentor, a Siemens Business. “Doing this in RTL is impractical so designers are turning to high-level synthesis (HLS) to implement these custom solutions. HLS provides a number of high-level optimizations, such as automatic memory partitioning for complex memory architectures needed by the PE array, interface synthesis of AXI4 memory master interfaces for easily connecting to system memory, and synthesis of arbitrary precision data types for tuning the precision of the multiple hardware architectures. Furthermore, because the source language is C++, it can easily plug back into the deep-learning framework where the network was originally created, allowing verification of the architected and quantized network.”

Understanding the choices
As architectural choices are being considered, power needs to be constantly assessed at several levels. “I am seeing more requests for profile power,” says Gupta. “That is an engine optimized for looking at very long vectors and quickly coming up with a power waveform. The intent is to be able to see the shape of the power waveform. Then, you can create a moving average, and this could be related to the thermal time constant. Now you are looking at an averaged power waveform over time. This is reflective of the way temperature would change. These waveforms can be consumed by other tools that take package and chip models into consideration along with this power waveform. Then you can look at airflow and other cooling considerations.”

Fig 1. Accuracy and performance analysis tradeoffs. Source: ANSYS.

Fig 1. Accuracy and performance analysis tradeoffs. Source: ANSYS.

Analysis has to happen at various abstraction levels. “Everything at the system level is relative,” says Stahl. “There is no way to know any absolute numbers. But things tend to be fairly big at the relative level and provide a good indication if you are heading in the right direction. Power is always a tradeoff, so you are looking at throughput, resources, clock frequency, etc. But the only way to know exactly what is going on is to take the final architecture, at the RTL, and run the algorithms that have gone through the software stack.”

To get long enough runs, emulation is the only possible solution. “The first step is using a weighted power model that runs in the emulator,” explains Stahl. “We can calculate the weighted model by instrumenting the RTL with the weight functions for critical points in the design. That could be the memory, the clock network, etc. This weighted model is compiled along with the RTL and runs in the emulator. What comes out is the sum of the weights over time. This preserves maximum emulation speed.”

Once you have the general power waveform, you can concentrate on the areas of concern. “You can rerun emulation on the critical windows and dump out the entire state information,” continues Stahl. “You can either take the expanded waveforms and calculate average power based on toggle activity or you can take the full activity file into a power analysis tool where you can get the exact average power. Alternatively, you can get cycle-by-cycle power. With that method, we get enough accuracy to decide, within this 10 million cycle emulation, the relative power of two different architectures that we might explore or if there are any events that look unusual.”

Learning on the edge
To enable learning on the edge, power becomes a critical design criteria. “Along with PPA, we have to add another A, which is accuracy. And there is another P, which is programmability,” says Dave Pursley, product management director at Cadence. “For programmability you use a general-purpose processor, or GPU or TPU, or you can go the whole way to the other end which would be characterized by the inability to change anything except perhaps for the coefficients. This is only meant for a single application.”

Early success may be related to how much you are trying to achieve. “If you have new adversarial attacks, you need to be able to update those in real time, and the best way to do that is at the edge,” says Synopsys’ Lowman. “When you train on lesser computational capabilities, the training takes forever to the point where it may become impractical. So there is a mix of what you can do in the cloud versus at the edge, and there are tradeoffs in that development. If you do it at the edge, it would be a significant advantage. It is just that the compute time to bring something to market may be impractical.”

There are applications already pushing for this. “Nobody wants, when they say a keyword, for everyone’s device to wake up,” says Jim Steele, vice president of technology strategy for Knowles. “When you unbox an iPhone, you have to say, ‘Hey Siri’ multiple times so that it learns your voice characteristics and will only let your voice in. That is learning at the edge. It is very supervised learning in that you know they are saying the keyword because they have been instructed to do so, and you can do this without sacrificing performance.”

The key is setting realistic goals. “Learning on the edge, in the near term, will probably consist of incrementally updating the weights based on new training data gathered by the deployed network,” says Mentor’s Fingeroff. “The inference engine, which is already running in hardware, can be used directly for the feed-forward part of the network. The ‘learning’, which is not done in real-time, can be done using a very compact compute engine that can be optimized for area and power. Incremental learning will use a single data set, compared to training with tens of thousands or more data sets.”

Capability is pushed by need, and that means money. When enough money is thrown at a problem, solutions will be found. Money is already flowing into this area, so the perception is that learning on the edge is necessary. Early solutions have started to form, and this may dictate the progression of some aspects of learning algorithms.

Related Articles
HW/SW Design At The Intelligent Edge
Systems are extremely specific and power-constrained, which makes design extremely complex.
Power Is Limiting Machine Learning Deployments
Rollouts are constrained by the amount of power consumed, and that may get worse before it gets better.
Machine Learning Inferencing Moves To Mobile Devices
TinyML movement pushes high-performance compute into much smaller devices.
Edge Computing Knowledge Center
Library of articles on Edge Computing

Leave a Reply

(Note: This name will be displayed publicly)