Defining And Improving AI Performance

What does performance mean when discussing artificial intelligence? Many minds are working the problem, but not yet in a unified manner.


Many companies are developing AI chips, both for training and for inference. Although getting the required functionality is important, many solutions will be judged by their performance characteristics. Performance can be measured in different ways, such as number of inferences per second or per watt. These figures are dependent on a lot of factors, not just the hardware architecture. The optimization of the software toolchain is a lot more complex than it was for general-purpose processors.

Some early devices have not met their objectives. “In talking to customers evaluating AI Inference solutions, we consistently hear that actual results often fail to come close to the original claims,” says Geoff Tate, CEO of Flex Logix.

Some of this is due to the immaturity of the technology. “Think of the time it takes to go from an original design concept and take that through to silicon,” says Andrew Grant, senior product director at Imagination Technologies. “That can be two to three years. They are all gambling and trying to work out the trajectory of the market. Will they still be executing the right workloads in two years time or five years time? If not, they will be left by the roadside as something else takes over and their market is gone.”

The AI market is complex with many goals. “Everyone is trying to figure out what is good enough,” says Ron Lowman, product marketing manager for artificial intelligence at Synopsys. “It comes down to the use case. Some use cases require much better accuracy — such as in automotive, compared to some simple object identification. Some care about performance per millijoule or per watt, or inferences per second. It all comes down to the use case and what they are trying to accomplish.”

Current implementations are a long way from being ideal. “Estimations say that a network of processor cores with the overall compute power of the human brain will fail in meeting the power consumption by at least four orders of magnitude,” says Roland Jancke, head of department for design methodology at Fraunhofer IIS Engineering of Adaptive Systems Division. “The efficiency seems to be related to very small yet flexible compute nodes with numerous configurable interconnects.”

And power is a big concern. “The die sizes are often particularly large with thousands of cores all striving for maximum performance and data throughput on advanced nodes,” says Richard McPartland, technical marketing manager for Moortec. “Power requirements are typically so large — often in the hundreds of watts range — that thermal management and power distribution are key issues. With so much power pouring into these large die, temperature monitoring of critical circuits is a must. Typically, we see tens of temperature sensors being implemented across large AI die to monitor multiple clusters of cores. These enable thermal load balancing, where workloads are distributed not just based on which cores are available, but also on real-time temperatures. Accurate temperature sensors enable finer-grain throttling of compute elements and help keep compute throttling to the minimum.”

Accuracy is a new optimization criterion. “The main metric is performance, meaning throughput in terms of how many inferences or trainings I can process in a certain amount of time,” says Tim Kogel, principal engineer at Synopsys. “Second is power consumption and that applies both for embedded devices and the data center, where power has become the primary cost metric. It is also accuracy — the quality of the results — how good is the inference. I can optimize an implementation by reducing quantization, but that has a cost in terms of loss of accuracy.”

Some loss of accuracy is to be expected. “They would like the accuracy of inference to be very close to what they are getting with the trained network and algorithm,” says Lazaar Louis, senior director of product management, marketing and business development for Tensilica IP at Cadence. “It is trained in floating-point and they have accuracy expectations. Research and examples have shown that you can achieve very good inference with integer processing, and you do not need floating point. Some companies say that they are willing to tolerate a couple of percent error because it is okay for their application and they want the best performance. It is very important that we have a software stack that can meet those requirements.”

And this is where things start to get complicated. “This is the beauty and the challenge,” adds Synopsys’ Kogel. “All parts matter because they are coupled. It does not make sense to come up with the beautiful hardware architecture and not have a good ML compiler that takes advantage of all the features in your hardware. When talking about algorithms, which includes the quality of the data and the architecture of the network itself, it is the domain of the data scientists. When looking at it from the perspective of the semiconductor, or the system house that is building an inference chip, it is about mapping these algorithms and running them. It is the toolchain and the hardware together that are responsible for the quality of results and how well the metrics are met. Plus, the combination of metrics is different for different applications. ML can be applied in many areas and all of them have different requirements for performance, power and accuracy. It is difficult to come up with one that fits all.”

Three levels of optimization
There are three discrete levels at which decisions and optimization can be made — algorithm development, mapping and hardware architecture.

“There has been an uptick in the development of different types of network,” says Synopsys’ Lowman. “Convolutional neural networks (CNN) are becoming more mature and I think architectures have looked at how to optimize for those. More recently there are new networks such as recurrent neural networks (RNN) that has slightly different math where you do some time-based understanding. You are feeding previous values back. Then there are spiking NN, which are quite different.”

The separation between a network and the final performance of inference is large. “There are some parameters that the data scientists know,” says Cadence’s Louis. “These are dependent upon the architecture they have chosen. They know how much compute is required to do inference on one frame based on that architecture. They can structure the network accordingly, but once you have trained it, only then will you know the actual performance for that network — so your frames per second on that network will only be known after you complete the work — but they do have some idea.”

How close those results are is difficult to tell. “It is a property of the algorithm as to how many operations are needed and how much data is required,” says Kogel. “But there is so much that happens in the mapping step, such as selecting the right quantization and data types, optimizations that are made in the compiler such as layer fusion, pipelining, unrolling, tiling of all of the processing loops — which in the end determines what the final performance, accuracy and power will be. To some degree it is even dangerous to make these early assumptions.”

Few people understand the full implications of those tradeoffs. “This is another reason why early devices are not meeting expectations,” says Flex Logix’s Tate. “In many cases the hardware architecture was done and then the software team was hired. The right way to do it is to co-develop the hardware and the software in conjunction to achieve optimum results.”

Kogel notes that some of these tradeoffs can have unexpected consequences. “There is a trend to make the NNs smaller in terms of the amount of data and processing, but that can have inverse effects later in implementation,” he says. “For example, when you reduce the data, you also reduce the computational intensity, so the implementation becomes more dependent on memory bandwidth, which becomes the limiting factor. This is typically a more difficult problem to solve than just providing more horsepower for processing. Or you reduce the data using compression on the weights, but you make the amount of data that you effectively use less predictable. It is no longer regular, and that might have an adverse effect on some aspect of your design. It is a many-coupled problem, and many things have to be considered together.”

Hardware architectures
Getting the hardware architecture right is not easy. “Many AI Inference architectures have characteristics that make modeling performance difficult,” says Tate. “To draw an analogy, look at a multicore processor. Consider that an eight-core processor doesn’t run eight times faster than a single-core processor. How much faster it runs depends on things like cache hit rates and resolution of bus access contention and shared memory access contention, which are very hard to model.”

When talking about AI architectures, most people think about multiply accumulator (MAC) arrays. “There is the MAC and then there is the architecture around it — in particular, the architecture of the memories,” says Lowman. “We have seen a big uptick in specialized memories. Some people need highly dense memories, some need very low leakage memory. We are being asked to do custom memories just for those types of implementations and what they are trying to accomplish.”

Multi-port memories are very popular for AI. “This means you can parallelize reads and writes when you are doing the math,” continues Lowman. “That can cut the power in half. Sometimes they will want to optimize the bitcells for density, perhaps because they need more coefficients. Others want to optimize for leakage, and that is a different tradeoff. There is always a tradeoff between density and leakage, or size and leakage and performance.”

Others see similar tradeoffs. “I see demand for a pseudo two-port memory,” says Farzad Zarrinfar, managing director of IP at Mentor, a Siemens Business. “Traditionally, there was single-port from which you can read or write, or dual-port where each port could be read or write. Two port is one port read, one port write. Now I see demand for pseudo two-port, which can utilize the six-transistor SRAM rather than eight-transistor cell, which is what dual and two port use. You operate on both edges of the clock. On one edge you can read and the other you can write. When you are talking about thousands of tiles, then size matters.”

Manufacturing these chips also can lead to surprises. “The combination of large dies on advanced process nodes immediately calls the challenge of process variation to mind,” says Moortec’s McPartland. “Embedding process detectors across the die, typically with one per cluster of AI cores, enables die-to-die and within-die process variation to be easily and independently monitored. These detectors can be used to enable voltage scaling schemes to be efficiently implemented, and the supply voltage optimized on a per-die basis or the device speed binned.”

As stated, these architectures are chasing a moving target. “There is a lot of research going into this area, and you can expect a certain evolution,” says Kogel. “Estimating what will come is not easy and adds to the challenge that you need to provide a certain level of flexibility. Then the question is, ‘How much?’ An FPGA is very flexible, but a DSP may be more optimized. Even given an algorithm or a target application, what is the best fit in terms of architecture and the right level of flexibility plus the power/performance metrics.”

“There is value in flexibility,” says Louis. “It is important to strike the right balance between creating an engine that does well in all of the workloads that all of our customers are doing today, but also an engine that has some flexibility to allow new innovations to be taken advantage of.”

The compiler
Sitting in the middle, between the algorithm developers and the hardware architects, is the compiler. This is considerably more complex than compilers for traditional ISAs. “First is the conversion from floating point to fixed point,” says Louis. “Quantization is fairly well understood. We then can look at optimization. For example, there may be neurons or connections that are duplicated or are not necessary and we can start removing those. We can start merging layers where possible to reduce the amount of compute that you need to do for the same application or network and achieve similar accuracy and performance. Also, there are some applications where the customer does not want the network to be modified because it may be a safety-critical application, such as automotive. They do not want to modify the network because they do not know what corner cases may no longer perform well.”

Software can ratchet up development costs as teams struggle to get this right. “Companies often spend twice as much to develop the software as they did for the hardware,” adds Lowman. “You can model things in the cloud and spit it out via Onyx or Caffe2 or TensorFlow, but then you need a bit mapping tool to quantize it, compress it, make sure it fits into a very tight resource. That is expensive. During that process you may lose some accuracy and you are not really sure why. It can take a few iterations to get it right.”

The industry is in the process of developing benchmarks for AI. This always leads to a variety of opinions. “There is nothing wrong with benchmarks, but they should be relevant to the workload planned,” says Tate. “Benchmarks like ResNet-50 using 224 x 224 images are not relevant. Customers have sensors with megapixel images and accuracy comes from more resolution. A very small image will not stress the memory subsystem and can lead to incorrect conclusions about the relative merits of various chips.”

Benchmarks create a common currency. “What you don’t want to see is what we have seen in graphics, where a particular benchmark becomes dominant because everyone has heard of it and you find companies designing to get the best score on that benchmark,” says Imagination’s Grant. “That distorts what they do. What you need is a basket of relevant benchmarks that get updated over time that the industry can understand and work with, but which do not distort the picture. We should not be trying to drive AI forward by looking in the rear-view mirror, optimizing stuff that was important years ago.”

There is no shortage of support for that view. “Benchmarks are useful, but not enough to address the wide variety of possible architectural implementations,” says Benjamin Prautsch, group manager for mixed-signal automation at the Fraunhofer Institute. “Benchmarks cannot hope to cover this diversity. There are already studies available that compare different AI structures among problem classes. Standard topologies derive from that, but this is more an aid for architectural decisions and does not yet solve the co-design problem between algorithms and architecture.”

Several attempts have been made to create benchmark suites. “New efforts, like MLPerf, bring in all of the players in the industry and define what the right set of benchmarks are for various applications,” says Louis. “This includes training and inference. It is trying to maintain that common playing field. People are not trying to take advantage of the benchmark. Instead, they are defining it to be broad and applicable to real-world examples. MLPerf is a collection of five benchmarks, five networks that are brought together to represent some of the real world. It minimizes the ability for someone to do something just for the benchmark and not the real-world application.”

Today, the industry may be moving too quickly for many of these to make sense at the hardware level. “There is not enough time for people to optimize for benchmarks,” says Lowman. “They would get left behind. They are using it as a general tool to narrow some of their decisions. For instance, when you look at the mobile space, those accelerators are perhaps 5 to 20 TOPs. They are saying we need to be above 20 TOPs in the future, but then you ask, ‘Can a phone handle that? What will be the power budget?’ They may have some inference/S or general indicators in mind, but until they run the application, they don’t really know.”

Related Stories & Resources
AI Knowledge Center
Special reports, top stories, videos, white papers and blogs on AI
Using FPGAs For AI
How good are standard FPGAs for AI purposes, and how different will dedicated FPGA-based devices be from them?
Monitoring Heat On AI Chips
How to reduce margin and improve performance on very large devices.
How Hardware Can Bias AI Data
Degrading sensors and other devices can skew AI data in ways that are difficult to discern.
Why Data Is So Difficult To Protect In AI Chips
AI systems are designed to move data through at high speed, not limit access. That creates a security risk.


Gene Mosher says:

Really excellent article, Mr. Brian Bailey. Thank You !!

Amirali Amirsoleimani says:

Very Insightful!

Leave a Reply

(Note: This name will be displayed publicly)