Fast, Low-Power Inferencing

Significant contributors to performance and efficiency for custom developed hardware inferencing accelerators.


Power and performance are often thought of as opposing goals, opposite sides of the same coin if you will. A system can be run really fast, but it will burn a lot of power. Ease up on the accelerator and power consumption goes down, but so does performance. Optimizing for both power and performance is challenging.

Inferencing algorithms for Convolutional Neural Networks (CNN) are compute intensive. For large networks they can involve billions of multiply accumulate operations to produce an inference. Of course, this means they take a long time to run, and they burn a lot of power. While CNNs are often developed on general purpose computers in python machine learning frameworks, such as TensorFlow, many will be deployed on embedded systems. These are showing up all around us: the smart speaker that wakes up when you say “Alexa,” the car that gently nudges you back into your lane, and the phone that unlocks when it sees your face. But embedded systems are often limited both in terms of compute capability and available power, so any inferencing algorithms will need to fit into those limitations.

While power is an important metric, when considering the options for deploying an inferencing algorithm energy per inference should be looked at, too. Recall that power is energy per unit of time – watts = joules / second. To determine energy per inference simply multiply power consumption by the time to perform the computation – joules = watts x seconds. Battery powered embedded systems are ultimately limited by the energy that can be stored in the battery, which is a fixed number of joules.

When deploying an inferencing algorithm in an embedded system there are a number of choices available. One option is to run it on an embedded CPU. This will almost certainly be the slowest option. Another option is a GPU. GPUs bring lots of parallel compute capability to the party. It will compute an inference a lot faster than a CPU. This option will probably have the highest power consumption. Tensor Processing Units (TPUs) are another option. Like GPUs, they bring an array of processing elements to work on the problem. TPUs are specifically tailored to work on inferencing, so they will generally be more efficient than a GPU. There are some very efficient edge TPUs from Google and NVIDIA. But to really push the boundaries of performance and efficiency requires a bespoke accelerator, deployed in an ASIC or FPGA.

Two of the most significant contributors to performance and efficiency for a custom developed hardware inferencing accelerator are quantization and data caching.

When data scientists come up with a CNN for a specific problem, they will likely use 32-bit floating-point numbers in all their calculations. It is small enough to minimize the memory footprint and can be immediately operated on by the floating-point instructions in the CPU. But when implementing a hardware accelerator, we are not required to use 32-bit floating-point numbers, any representation that delivers the right answer can be used. Fixed-point representations can be used which results in smaller, and thus lower power, hardware. A 32-bit fixed-point multiplier is about half the area of a 32-bit floating-point multiplier.

Reducing the number of bits in a number’s representation can further improve efficiency. The area of a multiplier is roughly the square of the size of the input factors. A multiplier with two 16-bit inputs is about one fourth the area of a multiplier with two 32-bit inputs. An 8-bit multiplier is one sixteenth the area.

It turns out that using a 32-bit floating-point representation of numbers in an inferencing algorithm is often much more precision than is needed. As an experiment, we tested the accuracy of ResNET, an object recognition algorithm, using fixed point numbers from 32 bits down to one bit. What we found is that fixed-point numbers down to 8 bits produced the same accuracy as 32-bit floating-point numbers. With 6- and 7-bit fixed point numbers accuracy was still above 95%. Figure 1 shows the accuracy of ResNET inferencing for different fixed-point representations.

Fig. 1: Accuracy vs. Bit Width for ResNET.

Another one or two bits can often be shaved off the numeric representation by using saturating arithmetic. Normally, if a math operation overflows the size of the output storage the high order bits are just dropped. The result is wrong, really wrong, and any further computations on it will continue to be wrong. This means that the hardware for an inferencing algorithm needs to be able to store the largest possible results from any computation, just in case. However, data scientists have learned that the magnitude of large numbers, both positive and negative, are not all that significant when computing an inference. This is especially true if the intermediate results are “normalized,” or scaled between 1.0 and -1.0, between layers. With saturating math if an operation is going to overflow the output storage area the largest possible number for the representation is stored, instead of dropping the high order bits and keeping an incorrect result. The result is still wrong, but it is as close to the right answer as possible given the hardware.

To produce the best results, the CNN should be trained using the same number representation as will be used for inferencing. Fortunately, this is not that difficult. The folks at Google have produced a package called “QKeras”, or quantized Keras which can be found at It has the same capabilities as the popular machine learning framework Keras ( Instead of using 32-bit floating point numbers it uses fixed point numbers and operators. It employs the open source algorithmic C data types defined at The user defines the size of the fixed point numbers to be used in performing any of the training or inferencing calculations. Converting a Keras CNN to a QKeras network is almost trivial.

Quantizing the network, or reducing the number of bits in the numeric representation as much as possible, impacts performance and power in two ways. First, the math and logical operators used are smaller and more power efficient. As described, ResNET has the same accuracy whether using 32-bit floating point numbers or 8-bit fixed point numbers (even without saturating math), but 8-bit fixed point multipliers are only 3.1% of the area of a 32-bit floating point multiplier. Second, there is less data to be stored and moved. Inferencing accelerators tend to have a lot more memory than logic. Therefore, the power consumption is dominated by the leakage current of these large memories. Anything that can be done to reduce the storage requirements for either weights or intermediate results is a big win for power and performance.

Figure 2 shows the effect of quantization, saturating math, and retraining on the accuracy of a network for different numeric representations. One curve shows the accuracy of a quantized network without retraining or saturating math. Another shows the accuracy using saturating math but without retraining. And finally, a curve where the network was retrained using quantized data and employs saturating math operations.

Fig. 2: Fixed Point Inferencing.

The second significant contributor to performance and efficiency is data caching. The engineers at NVIDIA who were working on a machine learning accelerator observed that the energy cost of reading data from an off-chip memory was about 200 times more expensive than performing an ALU operation on it. It turns out that in an inferencing accelerator data movement is going to be the dominant factor for both performance and power. Any reduction in the size of that data is important – every little bit helps. But also fetching the data as little as possible will help.

Consider the TensorFlow function conv2d(), the workhorse of traditional CNNs. When using a 3×3 convolution kernel, each feature map datum will be used 9 times for each output image. If there are 512 output images (not an unusually large number), each feature map datum will be used in 4,608 calculations. Fetching the datum each time it is used will clobber both performance and power, especially if it is being fetched from off-chip memory. A very small amount of memory can hold the feature data locally for computing each output image value. This immediately reduces the number of fetches by a factor of 9. By adding partial sum arrays equal to the size of all the output images, each input feature map value could be read in only once – with all the calculations depending on it performed before it is discarded. In practice, for most networks, storing multiple input or output images is impractical, so some amount of “tiling” is needed. A “tile” is a portion of each input or output image that is cached in the accelerator and operated on.

In constructing an inferencing accelerator, the designer needs to pick the right tile size, number of tiles, and number of multipliers. What’s the right answer? It depends on the topology of the network being implemented, the silicon area available, the energy budget, and the performance requirements. Unfortunately, there is no formula for making a good selection. The designer needs to explore the alternatives and pick the right set of trade-offs.

It’s possible to use these techniques to create a fast, efficient inferencing accelerator. Table 1 shows our results from creating an accelerator for an MNIST handwritten digit recognition CNN. Adding the accelerator to the processor design does make the power and area for the overall design go up. But the performance, as compared to software, is almost 200 times faster. And the energy per inference is less than 1% of the energy needed to perform the inference in software on the CPU.

Table 1: MNIST Power/Area Results.

The quantization analysis and optimization should be done in a machine learning framework, QKeras is ideal. This will allow the developer to understand the impact of the numeric representation on the size and accuracy of the network. Analyzing and optimizing data movement should be done in a C++ or SystemC model, which embodies all the memory, cache, and buffer elements in the design and the data movement between them. It needs to run fast enough to perform an inference in few minutes or less. This will allow the designer to adequately explore the myriad of alternative architectures. The engineers at NVIDIA developing an inferencing accelerator used a SystemC package called “Matchlib” to do this type of modeling. It can be found here at github: It is simply impractical to do the analysis needed in a traditional HDL. If using C++ or SystemC as a modeling language then High-Level Synthesis can be used to derive an RTL implementation. This can then be run through RTL synthesis, and place and route tools to create downstream implementations which can be used for determining power, area, and energy. This combination of tools allows the developer to quickly understand the power, performance, and area of different implementations and find an optimal design.


Harri W. says:

Thanks for the clear article. I just found it when looking for information on QKeras. Some questions came to my mind. Does a saturating fixed point mean, for example, clipped ReLU? What method did you use when moving from QKeras model to SystemC model? To my understanding, QKeras utilizes so-called “quantization-aware training” method. I’m not sure if it matters in this context?

Leave a Reply

(Note: This name will be displayed publicly)