The Efficiency Problem

Part 2: Solving Power Limitations for CNNs on DSP Processors


Part one of this report addressed the efficiency problem in neural networks. This segment addresses efficiencies in training, quantization, and optimizing the network and the hardware.

Minimize the Bits (CNN Advanced Quantization)
Training a CNN involves assigning weight vectors to certain results, and applying adaptive filters to those results to determine the positives, false positives, and negatives for that input.

Assigning those weights, or the process of quantization, can occur in one of two process flows:

Post-training quantization: This flow focuses on processing the results after gathering data and assigning weights based on the patterns that emerge during the training.

This method requires much more data processing than:

During-training quantization: This flow consists of “teaching” (or pre-setting) the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. The system then can commence with its own “learning”, using test data.

This latter method is called stochastic gradient descent (SGD).

After training, the performance of the system is measured on a different set of examples. This serves to test the generalization ability of the machine — its ability to produce sensible answers with new inputs that it has never seen during training.

Figure 1: CNN Advanced Quantization

And yet, the data size is still unwieldy. Several solutions may address this issue, including using a lower-precision fixed-point coefficient and batching.

Lower-precision fixed-point coefficient

The easiest way to reduce memory footprint is by using lower-precision fixed-point coefficients. The question arises, of course, whether the loss of accuracy is worth the amount saved.

Cadence engineers have shown that when changing from 16- to 8-bit data, coefficients, or both, there is very little loss of accuracy (1% or less). The results in a reduction of memory bandwidth by a factor of four (compared to floating point), or by a factor of two (compared to using 16-bit representation).

Figure 2 shows graphs using the German Traffic Sign Benchmark (GTSB), which is a best-case test for vision recognition, because there are a limited number of signs to recognize, and they have been designed to be recognized easily in poor visibility conditions.

Fig. 2: Performance vs. Complexity, and Complexity vs. Accuracy.

Eight-bit operations consume less power than 32-bit (or 16-bit) operations, and less power is expended in the reduced memory reads that accompany the lower bandwidth. Even further, SIMD architectures with 8-bit data type support results in higher performance, because four times as many operations are possible on 8-bit data as on 32-bit data, and twice as many as on 16-bit data.

Fully connected layers are data intensive and are therefore load bound. Batching, which means processing multiple frames together at the fully connected layer and re-utilizing the loaded coefficients, averages out the bandwidth by the batch size, by making it compute bound. This reduces bandwidth and increases performance.

An example of the results with the use of batching is shown here, using AlexNet as an example.

Fig. 3: Batching and Bandwidth in AlexNet.

Determining batch size depends on various factors, including DDR latency, local memory size, tile (window) and stride size management, and so forth. The batch can then be processed by max-pooling, average-pooling, and others.

Optimize the Network Architecture
Cadence’s Tensilica Vision P6 DSP uses a VLIW SIMD architecture, which supports issuing one to five operations in parallel, with operations that can operate on multiple operands of vectors of 64 8-bit or 32 16-bit vector data simultaneously.

Further, this process can be automated. Cadence has been doing work with a generic superset network architecture called CactusNet, in which the network architecture can be optimized incrementally and analyzed for sensitivity.

Fig. 4: Cactus architecture.

Excellent results have been obtained using CactusNet with the GTSRB. CactusNet can result in the same recognition rate as other leaders, but with two orders of magnitude lower complexity.

Use Optimized Hardware
The final method to address the data efficiency problem is to use a CNN-optimized network, such as the Tensilica Vision P6 DSP. This product incorporates advanced VLIW/SIMD support with a high number of ALU operations per processor cycle, as well as offering a flexible memory bus. The Tensilica Vision DSP family also efficiently speeds up pixel processing, using instructions optimized for the automotive industry, with power-consumption levels that significantly reduce the need for hardware accelerators. The DSPs also offer an integrated DMA engine and an optional vector floating-point unit, answering all the requirements for an efficient DSP, from data usage to power consumption.

For CNN to achieve its full potential, a two- to three- order of magnitude power efficiency improvement is needed. This efficiency must come from optimized software algorithms and hardware IP, as shown by Cadence in its Tensilica Vision DSP family. The future will bring new products to bring computer vision closer to reality. Until that day, save the crossword puzzle for when someone else in your carpool is driving, and don’t spill your coffee on the way.