Limitations for convolutional neural nets on DSP processors.
The field of automotive automation has been the driver – so to speak – of the next leap of innovation in the field of transportation. Car architectures are being re-engineered to take advantage of incredible leaps in automation, using more powerful processors that process more data than ever before.
The recent focus on autonomous automobile technology could be due to the ongoing drop in the cost of radar, infrared imagers, sonar, GPS, and other sensors – but it could also be due to the dramatic improvement in the processing power of embedded systems. Without the processing power required to interpret the data, it doesn’t matter how sophisticated or inexpensive imaging technology may be.
Vision, in this context, is the ability for embedded systems to extract meaning from “images” and make split-second decisions based on this content. The efficacy of convolutional neural nets (CNNs) used in image recognition has moved the industry forward by leaps and bounds. In terms of microelectronics, DSP development has historically been driven by managing the power requirements of the processors. The same priorities apply when developing embedded CNNs to automotive automation – allowing us to put our feet up and focus on finishing yesterday’s crossword puzzle during our morning commute.
The Efficiency Problem
At the recent Embedded Neural Network Summit held at Cadence in San Jose, it became clear that a major challenge for deep neural networks (DNNs) to be successfully implemented in embedded devices is power management. Currently, the state-of-the-art in CNN implementation uses 40W/TMAC (watts per tera-multiple-accumulate), and a typical application requires 4TMAC, resulting in a 160W power requirement. Our current embedded devices’ power budgets simply cannot accommodate this.
Four ways to address this problem are to:
In this two-part article, I’ll flesh out these four methods to conserve power when processing the vast amount of data required for complex vision processing.
Optimize the “Problem” Problem
The second question is how to reduce the size of pixel segmentation. How do you tell a system that an image, containing all kinds of noise, is actually an image of a road with a deer frozen in the headlights – and then expect the system to identify what is “road”, “horizon”, “mailbox”, “fencepost”, “deer”, and “oncoming traffic”? How does the system know that the visual field contains a hazard, and the car must stop immediately? In a single 375 × 1242-pixel image, there are almost 466,000 pixels requiring “classification.” Multiply that image by a unit of time – moving photos – and you have more data than you know what to do with.
Of course, the neural net can “prioritize” or segment the important pixel information. Rather than focus on one pixel at a time, a convolutional net takes in patches of pixels, or pools, and passes them through a filter. That filter, or kernel, is also a matrix smaller than the image itself, with the same depth as the input patch. The “job” of the filter is to find patterns in the pixels. This filter adds another layer to the source, and there may be multiple layers to “check” for different features (for example, edges, corners, horizontal/vertical lines, colors, and so forth). Once the feature is identified, the system can make some decisions based on the content.
From the research done at Cadence, they could “redefine the ground truth”, segmenting the image for features (instead of checking every pixel), making the problem 22 times smaller.
Figure 3: KITTI Road Segmentation Dataset
Next time, I’ll look at how to conserve power by minimizing the number of bits in the representations, optimizing the network architecture, and using optimized hardware for CNN.