Software Framework Requirements For Embedded Vision

Key factors to consider when choosing an embedded vision system.


Deep learning techniques such as convolutional neural networks (CNN) have significantly increased the accuracy—and therefore the adoption rate—of embedded vision for embedded systems. Starting with AlexNet’s win in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), deep learning has changed the market by drastically reducing the error rates for image classification and detection tasks (Figure 1). Deep learning has also changed how embedded vision is implemented. CNN graphs are not “programmed”—they are “trained” using a software framework and then mapped into the embedded vision hardware.

Figure 1: ImageNet Large Scale Visual Recognition Challenge Results shows that deep learning is surpassing human levels of accuracy.

An embedded vision system must offer equivalent performance of a GPU system at a fraction of the power and die area. Embedded vision applications are highly optimized heterogeneous systems, which means they have processing units that are optimized for their specific tasks: a scalar unit for control, a vector unit for pixel processing, and a dedicated CNN engine for executing deep learning networks. These units, specifically optimized for embedded vision applications, provide excellent performance for the smallest area and power. When evaluating software frameworks for the final vision application, requirements around availability, bit resolution, graph mapping tools, and optimization options for hardware should be taken into consideration.

The Basics: Supported Software Frameworks
The first question to ask when considering an embedded vision system is, “Which software frameworks are supported?” as not all software frameworks are supported by embedded vision system tools. The best choices are the most popular, like Caffe and Tensorflow, but they were not originally designed with embedded vision systems in mind. Training of deep neural networks has traditionally targeted CPU and GPU solutions that can easily accept 32-bit floating-point coefficients. The Khronos group is driving a new common standard but it’s not fully adopted yet.

Bit Resolution for Area and Performance
An embedded designer has more to consider while using a software framework for training a CNN graph for an embedded vision processor. Designers must pay attention to the bit resolution of the CNN calculations, possible hardware optimizations to consider during training, and how best to take advantage of new coefficient or feature map pruning and compression techniques.

Based on careful analysis of popular CNN graphs, Synopsys has determined that many CNN calculations on common classification graphs can achieve good accuracy with 8-bit precision. There are some graphs however that require 10 to 12 bit resolution to achieve the same accuracy as the floating point results achieved in a GPU. The CNN engine for the DesignWare EV6x Embedded Vision Processor (Figure 2) uses highly optimized 12-bit multiplications. Embedded systems have to be stingy on memory size—as too much memory adds to the cost and power consumption. Caffe graphs with 32-bit floating point output can be mapped to the Synopsys 12-bit CNN architecture with no loss in accuracy. The Synopsys CNN engine also supports 8-bit multipliers to support graphs trained for 8-bit.

Software frameworks are starting to pay more attention to embedded systems, so it will become possible to train graphs for specific bit resolutions.

Figure 2: DesignWare EV6x Vision Processor includes 1 to 4 scalar and vector units as well as a dedicated, tightly integrated CNN engine.

Graph Mapping Tools and Features
Graph mapping tools are critical to converting the output of the software framework to the appropriate bit resolution of your embedded vision processor. During training, a graph mapping tool converts the coefficients and graphs from the software framework into the format the embedded vision system recognizes for deployment (Figure 3).

Figure 3: Training and Deployment/Inference phases.

Hardware Optimization
Designers may be able to make optimizations based on the hardware during training. Say your hardware is better optimized for a certain convolution size. If, for example, 3×3 and 5×5 convolutions have better MAC utilization than 4×4 or other configurations, you may choose to benefit by using a certain convolution size during training. To gain these benefits, the person doing the training must be connected to the choice of the hardware platform. This often isn’t the case.

As new CNN graph techniques have improved in accuracy, they have also increased in the number of layers and therefore the number of coefficients needed. The more coefficients required, the more memory storage, memory transfer bandwidth, and MAC operations are required. Again, these impact power and die size. New techniques to reduce the number of coefficients while preserving graph accuracy involve coefficient pruning and decompression. Graph pruning is a technique that prunes or zeros out coefficients that are close to zero. Pruning is an iterative process that must be done as part of training because it requires access to the dataset. Pruning can significantly reduce the computations and bandwidth for a CNN graph. For pruning to be effective, the embedded CNN engine must support decompression which handles the irregular network connections that result from pruning.

Vision technology is enabling a wide range of applications, such as augmented/virtual reality, automated drone control, and smart surveillance, which offer intelligent and responsive capabilities. Deep learning software frameworks for embedded vision implementations are being embedded into larger SoCs, and applications that incorporate deep learning techniques will continue to be an attractive approach for vision developers in specific markets. To take full advantage of deep learning, software developers will need to examine many aspects of the software frameworks they are considering, including the final hardware deployment.

Leave a Reply

(Note: This name will be displayed publicly)