Hidden Costs In Faster, Low-Power AI Systems

Tradeoffs in AI/ML designs can affect everything from aging to reliability, but not always in predictable ways.


Chipmakers are building orders of magnitude better performance and energy efficiency into smart devices, but to achieve those goals they also are making tradeoffs that will have far-reaching, long-lasting, and in some cases unknown impacts.

Much of this activity is a direct result of pushing intelligence out to the edge, where it is needed to process, sort, and manage massive increases in data from sensors that are being integrated into just about all electronics. There are tens of billions of connected devices, many with multiple sensors collecting data in real time. Shipping all of that data to the cloud and back is impractical. There isn’t enough bandwidth. And even if there were, it requires too much energy and costs too much.

So chipmakers have trained their sights on improving efficiency and performance at the edge, leveraging multiple known and new approaches to speed up and reduce the power draw of AI/ML/DL systems. Among them:

  • Reduced accuracy. Computation in AI chips produces mathematical distributions rather than fixed numbers. The looser that distribution, the less accurate the results, and the less energy required to do that processing.
  • Better data. Reducing the amount of data that needs to be processed can significantly improve performance and energy efficiency. This requires being able to narrow what gets collected at the source, or the ability to quickly sift through data to determine what is valuable and what is not, sometimes using multiple stages of processing to refine that data.
  • Data-driven architectures. Unlike traditional processor designs, AI systems rely on both the faster flow of data between processing elements and memories, and shortened distances over which that data needs to travel.
  • Customized solutions. Algorithms can be made sparser and quantized, and accelerators can be tuned to specific algorithms, which can offer 100X or more improvements in performance with the same or less power.

Each of these approaches is effective, but all of them come with an associated cost. In some cases, that cost isn’t even fully understood because the tech industry is just beginning to embrace AI and where and how it can be used. That hasn’t deterred companies from adding AI everywhere, though. There is a frenzy of activity around building some form of AI into such edge devices as cars, consumer electronics, medical devices, and both on- and off-premise servers aimed at the still-unnamed gradations spanning from the “near” to “far” edge.

For AI systems, accuracy is roughly the equivalent of abstraction levels in tools. With high-level synthesis, for example, entire systems can be designed and modified at a very high level much more quickly than at the register transfer level. But this is only a rough outline of what the chip actually will look like.

The difference is that in AI systems, those higher level of abstractions may be sufficient for some purposes, such as detecting movement in a security system. Typically it is coupled with systems that deliver higher accuracy, but at the cost of either lower speed or higher power.

This isn’t a fixed formula, though, and the results aren’t always what you might expect. Researchers from the University of California at San Diego found that by blending high-accuracy results with low-accuracy results in the search for new materials, they actually improved the accuracy of even the highest accuracy systems by 30% to 40%.

“Sometimes there are very cheap ways of getting large quantities of data that are not very accurate, and there are very expensive ways of getting very accurate data,” said Shyue Ping Ong, nano-engineering professor at UC San Diego. “You can combine both data sources. You can use the very big data set — which is not very accurate, but which proves the underlying architecture of the machine learning model — to work on the smaller data to make more accurate predictions. In our case, we don’t do that sequentially. We combine both data streams.”

Ong noted this is not just limited to two data streams. It could include five or more different types of data. Theoretically, there is no limit, but the more streams the better.

The challenge is understanding and quantifying different accuracy levels, and understanding how systems using data at different levels of accuracy will mesh. So while it worked for materials engineering, it might not work in a medical device or a car, where two different accuracy levels could create wrong results.

“That’s an open problem,” said Rob Aitken, an Arm fellow. “If you have a system with a given accuracy, and another system with a different level of accuracy, their overall accuracy depends on how independent the two approaches are from one another, and what mechanism you use to combine the two. This is reasonably well understood in image recognition, but it’s harder with an automotive application where you have radar data and camera data. They’re independent of each other, but their accuracies are dependent on external factors. So if the radar says it’s a cat, and the camera says there’s nothing there at all, if it’s dark then you would assume the radar is right. But if it’s raining, then maybe the camera is right.”

This may be solved with redundant cameras and computation, but that requires more processing power and more weight, which in turn reduces the distance an electrified car can travel on a single charge and increases the overall cost of a vehicle. “So now you have to decide if that compensation is worth it, or is it better to apply the rule of thumb most of the time because that’s sufficient for your purpose,” Aitken said.

This is just one of many approaches being considered. “There are many knobs that are being researched, including lower-precision inference (binary, ternary) and sparsity to significantly reduce the computation and memory footprints,” said Nick Ni, director of product marketing for AI and software at Xilinx. “We have demonstrated over 10X speed-up using sparse models running FPGAs by implementing a sparse vector engine-based DSA. But some sparse models run very poorly — they often slow down — on CPUs, GPUs and AI chips, as many of them are designed to run traditional ‘dense’ AI models.”

Better data, but not necessarily more
Another approach is to improve the quality of the data being processed in the first place. This typically is done with a bigger data set. The general rule is that more data is better, but there is a growing recognition that isn’t necessarily true. By only collecting the right data, or by intelligently eliminating useless data, the efficiency and performance in one or more systems can be greatly improved. This is a very different way of looking at sparsity, and it requires using intelligence at the source or in multiple stages.

“By far the best way to improve the power efficiency is not to compute,” said Steven Woo, Rambus fellow and distinguished inventor. “There’s definitely a big gain if you can rule out information that you don’t need. Another way that people talk about doing this — and there’s a lot of work going on in this area — is sparsity. So once you have a trained neural network model, the way to think about this is neural networks are composed of nodes, neurons and connections between them. It’s really a multiply-accumulate kind of mathematical operation. You’re multiplying against what’s called a weight. It’s just a number that’s associated with the connection between two neurons. And if the weight of that connection is very, very small or close to zero, you may be able to round it to zero, in which case multiplying by a weight value that’s zero is the same as not doing any work. And so people introduce sparsity by first training a network, and then they look at the weight values and they just say, ‘Well, if they’re close enough to zero, I might be able to just say it’s zero.’ That’s another way of driving work out of the system.”

The challenge here is understanding what gets left behind. With a complex system of systems involving a mission-critical or safety-critical application, making those kinds of assumptions can cause serious problems. In others, it may go unnoticed. But in cases where multiple systems interact, the impact is unknown. And as multiple systems are combined over time due to different life expectancies, the number of unknowns increases.

One of the biggest knobs to turn for performance and power in AI systems is designing the hardware to take full advantage of the algorithm with as few wasted cycles as possible. On the software side, this involves being able to combine whatever is possible into a single multiply-accumulate function. The problem is that the tooling and the metrics for each are very different, and understanding cause and effect across disciplines is a challenge that has never been fully resolved.

“Software is a big part in all of this, and what you can do in software has a big impact on what you can do in hardware,” said Arun Venkatachar, vice president of AI and central engineering at Synopsys. “Many times you don’t need so many nodes. Leveraging the software area can help get the performance and the partitioning needed to make this happen. This needs to be part of the architecture and the tradeoffs you make on power.”

IBM, like most large systems companies, has been designing customized systems from the ground up. “The goal has been to convert algorithms into architecture and circuits,” said Mukesh Khare, vice president of hybrid cloud at IBM Research. “We’ve been focused more on the deep learning workload. For us, deep learning is the most critical part of the AI workload, and that requires an understanding of math and how to develop an architecture based on that. We’ve been working on developing building blocks in hardware and software so that developers writing code do not have to worry about the hardware. We’ve developed a common set of building blocks and tools.”

Khare said the goal is to improve compute efficiency by 1,000 times over 10 years by focusing on chip architectures, heterogeneous integration, and package technology where memory is moved closer to and closer to the AI accelerators. The company also plans to deploy analog AI using 3nm technology, where weights and a small MAC are stored in the memory itself.

Much of this has been discussed in the design world for the better part of a decade, and IBM is hardly alone. But rollouts of new technology don’t always proceed according to plan. There are dozens of startups working on specialized AI accelerator chips, some of which have been delayed due to almost continual changes in algorithms. This has put a spotlight on programmable accelerators, which intrinsically are slower than an optimized ASIC. But that loss in speed needs to be weighed against longer lifespans of some devices and the continual degradation of performance in accelerators that cannot adapt to changes in algorithms over that time period.

“Most of the modern advanced AI models are still designed for large-scale data center deployment, and it is challenging to fit into power/thermal-constrained edge devices while maintaining real-time performance,” said Xilinx’s Ni. “In addition, the model research is far from done, and there is constant innovation. Because of this, hardware adaptability to the latest models is critical to implement energy-efficient products based on AI. While CPU, GPU and AI chips are all essentially fixed hardware, where you need to rely on software optimization, FPGAs allow you to fully reconfigure the hardware with a new domain-specific architecture (DSA) that is designed for the latest models. In fact, we find it’s important to update the DSA periodically, ideally quarterly, to stay on top of the optimal performance and energy efficiency.”

Others agree. “Reconfigurable hardware platforms allow the needed flexibility and customization for upgrading and differentiation without requiring rebuilding,” said Raik Brinkmann, CEO of OneSpin Solutions. “Heterogenous computing environments that include software programmable engines, accelerators, and programmable logic are essential for achieving platform reconfigurability as well as meeting low latency, low power, high-performance and capacity demands. These complex systems are expensive to develop so anything that can be done to extend the life of the hardware while still maintaining customization will be essential.”

Customization and commonalities
Still, much of this depends on the particular application and the target market, particularly when it comes to devices connected to a battery.

“It depends on where you are in the edge,” said Frank Schirrmeister, senior group director of solutions marketing at Cadence. “Certain things you don’t want to change every minute, but virtual optimization is real. People can do a workload optimization at the scale they need, which may be hyperscale computing in the data center, and they will need to adapt these systems for their workloads.”

That customization likely will involve multiple chips, either within the same package or connected using some high-speed interconnect scheme. “So you actually deliver chipsets at a very complex level by not doing designs just at the chip level,” said Schirrmeister. “You’re now going to design by way of assembly, which uses 3D-IC techniques to assemble based on performance. That’s happening at a high complexity level.”

Fig. 1: Domain-specific AI systems. Source: Cadence

Many of these devices also include reconfigurability as part of the design because they are expensive to build and customize, and changes occur so rapid that by the time systems containing these chips are brought to market, they already may be obsolete. In the case of some consumer products, time to market may be as long as two years. With cars or medical devices, that can be as long as five years. During the course of that development cycle, algorithms may have changed dozens of times.

The challenge is to balance customization, which can add orders of magnitude improvements in performance for the same or less power, against these rapid changes. The solution appears to be a combination of programmability and flexibility in the architecture.

“If you look at the enterprise edge for something like medical imaging, you need high throughput, high accuracy and low power,” said Geoff Tate, CEO of Flex Logix. “To start with, you need an architecture that is better than a GPU. You need finer granularity. Rather than having a big matrix multiplier, we use one-dimensional Tensor processors that are modular, so you can combine them in different ways to do different convolutions and matrix applications. That requires a programmable interconnect. And the last thing is we have our compute very close to memory to minimize latency and power.”

Memory access plays a key role here, as well. “All computation takes place in the SRAM, and we use the DRAM for weights. For YOLOv3, there are 62 million int8 weights. You need to get those weights off the chip so that the DRAM is never in the performance path. They get loaded into SRAM on chip. When they’re all loaded up, and when the previous compute finished, then we switch over to compute using the new weights that came in. We bring them on in the background while we’re doing other computations.”

Sometimes those weights are re-used, and each layer has a different set of weights. But the main idea behind this is that not everything is used all the time, and not everything has to be stored on the same die.

Arm has been looking at efficiency from a different side, using commonalities as a starting point. “There are certain classes of neural networks that have similar structures,” said Aitken. “So even though there are a million applications, you only have a handful of different structures. As time goes on, they may diverge more, and the future we would hope there are a reasonable number of neural network structures. But as you get more of these over time, you can predict the evolution of them, as well.”

One of those areas is movement of data. The less it can be moved in the first place, and the shorter the distance that it has to be moved, the faster the results and the less power required.

“Data movement is really a big portion of the power budget these days,” said Rambus’ Woo. “Going to vertical stacking can alleviate that. It’s not without its own challenges, though. So there are issues with managing thermals, there’s issues with manufacturability, and issues with trying to merge pieces of silicon coming from different manufacturers together in a stack. Those are all things that have to be solved, but there is a benefit if that can happen.”

Fig. 2: How memory choices can affect power. Source: Rambus

That has other implications, as well. The more that circuits are utilized, the denser the heat, the harder it is to remove, and the faster circuits age. Minimizing the amount of processing can extend the lifetimes of entire systems.

“If we can make the slices of the pie smaller because we don’t need as much power to drive the data over longer distances, that does help the long-term reliability because you don’t see as large of a gradient in the temperature swings and you don’t see as much high power or high voltage related wear on the device,” Woo said. “But on the flip side, you have these devices in close proximity to each other, and memory doesn’t really like to be hot. On the other hand, a processor generally likes to burn power to get more performance.”

Rising design costs
Another piece of this puzzle involves the element of time. While there has been much attention paid to the reliability of traditional von Neumann designs over longer lifetimes, there has been far too little for AI systems. This is not just because this technology is being applied to new applications. AI systems are notoriously opaque, and they can evolve over time in ways that are not fully understood.

“The challenge is understanding what to measure, how to measure it, and what to do to make sure you have an optimized system,” said Anoop Saha, market development manager at Siemens EDA. “You can test how much time it takes to access data and how fast you process data, but this is very different from traditional semiconductor design. The architecture that is optimized for one model is not necessarily the same for another. You may have very different data types, and unit performance is not as important as system performance.”

This has an impact on how and what to partition in AI designs. “When you’re dealing with hardware-software co-design, you need to understand which part goes with which part of a system,” said Saha. “Some companies are using eFPGAs for this. Some are partitioning both hardware and software. You need to be able to understand this at a high level of abstraction and do a lot of design space exploration around the data and the pipeline and the microarchitecture. This is a system of systems, and if you look at the architecture of a car, for example, the SoC architecture depends on the overall architecture of the vehicle and overall system performance. But there’s another problem here, too. The silicon design typically takes two years, and by the time you use the architecture and optimize the performance you may have to go back and update the design again.”

This decision becomes more complex as designs are physically split up into multi-chip packages, Saha noted.

AI for AI
There also are practical limits to using AI technology. What works in one situation or market may not work as well in another, and even where it is proven to work there may be limits that are still being defined. This is apparent as the chip industry begins to leverage AI for various design and manufacturing processes based on a wide mix of data types and sources.

“We implement some AI technology into the current inspection solution, which we we call the AI ADT (anti-diffract technology),” said Damon Tsai, director of inspection product management at Onto Innovation. “So we can improve the sensitivity with higher power, but we also can reduce the noise that goes along with that, as well. So AI ADC can help us to improve the classification rate for defects. Without AI image technology, we would use a very simple attribute to tell, ‘This is a scratch, this is the particle.’ For defect purity, typically we can only achieve around 60%, which means that another 40% still requires the human review or a SEM (scanning electron microscope) review. That takes a lot of time. With AI, we can achieve more than 85% defect purity and accuracy compared to the traditional traditional image comparison technology, and in some cases we can do 95%. That means customers can reduce the number of operators and SEM review time and improve productivity. But if we cannot see a defect with brightfield or darkfield, AI cannot help.”

In other cases, the results may be surprisingly good, even if the process of obtaining those results isn’t well understood.

“One of the most interesting aspects of what we’re doing is we’re trying to understand advanced correlations between some of the ex-situ metrology data generated after the process has completed, and results obtained from machine learning and AI algorithms that use data from the sensors and in-process signals,” said David Fried, vice president of computational products at Lam Research. “Maybe there’s no reason that the sensor data would correlate or be a good surrogate for the ex-situ metrology data. But with machine learning and AI, we can find hidden signals. We might determine that some sensor in a given chamber, which really shouldn’t have any bearing on the process results, actually is measuring the final results. We’re learning how to interpret the complex signals coming from different sensors, so that we can perform real-time in-situ process control, even though on paper we don’t have a closed-form expression explaining why we’d do so.”

The chip industry is still at the very early stages of understanding how AI works and how best to apply it for specific applications and workloads. The first step is to get it working and to move it out of the data center, and then to improve the efficiency of systems.

What isn’t clear, though, is how those systems work in conjunction with other systems, what the impact of various power-saving approaches will be, and how these systems ultimately will interface with other systems when there is no human in the middle. In some cases, accuracy has been been unexpectedly improved, while in others the results are muddy, at best. But there is no turning back, and the industry will have to begin sharing data and results to understand the benefits and limitations of installing AI everywhere. This is a whole different approach to computing, and it will require an equally different way for companies to interact in order to push this technology forward without some major stumbles.

New Uses For AI
Big improvements in power and performance stem from low-level intelligence.


Mads Hommelgaard, Managing Director, MemXcell says:

Those interested in Compute-in-Memory solutions for AI I would encourage to review Dr. Stefan Cosemans presentation at 2020 ESSCIRC or his 2019 IEDM paper on this topic. Dr. Cosemans gives a very good insight into the architectures and the fundamental problems in implementation. His analysis includes the impact of component accuracy (DACs, ADCs and weights) and non-idealities, like how wire resistance impacts memory cell selection and array accuracy.

Leave a Reply

(Note: This name will be displayed publicly)