Neural networks have propelled embedded vision to the point where they can be incorporated into low-cost and low-power devices.
Embedded vision is becoming a topic of heated conversation thanks to the emergence of neural networks and their ability to make computer systems learn by example.
Neural networks are a very different kind of processing element compared to the other kind of processors we have in the IP arsenal today in that they are not programmed in the same manner. They do not have a stream of instructions that define what data should be used and how it should be manipulated. Instead, neural networks are trained using a set of example patterns and then they provide the best answer for patterns seen while in operation that had not been seen in the past. In other words, they learn in a manner similar to the human brain and come up with a best guess.
Computer vision has been performed and optimized on supercomputers for decades, but it is a very difficult problem. How can you algorithmically define what a cat looks like and how do you distinguish it from a dog? Simpler still, how do you perform highly reliable handwriting recognition? Over time algorithms were improved, but then researchers started looking toward neural networks and machine learning as an alternative way to approach the problem.
Neural nets attempt to replicate certain aspects of the way in which the brain works. A neuron has a number of inputs, with each input having a certain weight. A non-linear function is performed within the neuron that produces an output that in turn feeds other neurons.
“The basic aspect of the computation is incredibly regular and equally frustrating,” says Drew Wingard, CTO at Sonics. “The ideal hypothetical structure is that every neuron is connected to every other neuron so that every neuron can affect every other through the coupling. This could be one gigantic matrix multiply, but in reality most things are not actually connected and so you really need a sparse matrix. That is where the frustrating aspects come in.”
Optimization of the network around the target applications can yield dramatic benefits in recognition rates and especially in computational complexity of both training and inference. “Neural networks, especially convolutional neural networks are dominated by 1D, 2D and 3D filter operations that are closely similar to the operations used for implementing filters on DSPs,” explains Chris Rowen, fellow at Cadence.
Unlike conventional programming, a network is first trained and then executed. “Convolutional neural networks have emerged as a leading form of pattern recognizers because they allow for a simple, systematic [but computationally intensive] method for training,” says Rowen. “Training requires extremely heavy DSP-like computation (back propagation of errors with steepest-descent optimization). The recognition function itself (“forward inference”) is still computationally demanding, with some applications requiring tens of tera-ops per image frame, for example, but many neural network applications are comfortably within the capabilities of current high-end DSPs, and new CNN-optimized DSPs are coming. Moreover, a network may be trained once and used billions or trillions of times, so that the high expense of training is amortized across many uses.”
As an example, off-line training may use a GPU farm and look at 100,000 images of faces and then create a CNN graph that is programmed into the object detection engine. Those two stages are closely coupled. “When training is done, the graph is created to run on a specific engine,” says Mike Thompson, senior manager of product marketing for embedded vision in Synopsys. “There are characteristics of the object detection engine that you want to take advantage of. You want to train it to run on that platform. We can refine the amount of memory required and the number of processing elements needed as we look at the graph and make choices about how they are optimized.”
“The best simple predictor of neural network performance is a multiply-accumulate rate,” says Rowen. “Traditional CPUs therefore are capable of doing neural network computations, but their multiply-accumulates per square mm and per watt are quite low. Current high-end DSPs are 10-100x more efficient than CPUs on this metric, and specialized ultra-parallel DSPs are even better.”
The CNN has been around for quite a while, but recent improvements in algorithms have allowed it to realize much higher accuracy. “Today, F1 scores are in the low 90s and accuracy in the high 90s,” says Thompson. F1 is a face-based biometric authentication system that considers both the precision and recall to compute a score that has 1 as the best value and 0 as the worst. “Market growth is fueled by the increasing performance and decreasing cost and power consumption of processors, and by the growing awareness of the value that can be delivered via object detection, tracking, recognition and other vision processing functions.”
While vision within the automobile may be the first application that comes to mind, cars may not be the biggest user of face recognition today. “There are some amazing statistics about the fraction of time the major photo sharing sites spend on training neural networks to have them recognize people, says Wingard. “A lot of the investment dollars these days is coming from image identification in a Facebook photo. People think it is worth investing money on this. People are operating under the impression that they can mine the data that is contained within the images. Long term they will want to find a way to make money from it.”
Apple is getting into the action, as well, with apps that will recognize people in photos you take. Then they make it easier to share them with your friends. A new patent (20150227782) was published just a few days ago related to this.
Learning on the job
Because of the complexity of the training process, this step is normally performed off-line and the deployed systems would not have the ability to learn. That doesn’t mean the training stops, though. “The network would need to be improved or updated in the field, either because new algorithms become available, or inputs change,” says Wingard. “The nature of the training is that you are not getting a global optimum. You never have the ‘right’ program for this. It is a matter of do you have a ‘good enough’ program that enables you to recognize what you are looking for.”
Interestingly, this creates two very different problems and opportunities. “Since training is usually much more complex than recognition (inference), networks for mass deployment are retrained less frequently,” says Rowen. “Nevertheless, accelerated training is a hot area (some advanced companies are now building ‘training supercomputers’), so frequent retraining is entirely feasible. In addition, we anticipate systems will carry many training sets, and will adaptively choose training sets based on dynamic conditions. Neural network training also lends itself to partial retraining, which can be much faster than full retraining and serves well when the recognition task is evolving incrementally.”
While they may not be able to retrain themselves, it doesn’t means that they cannot improve over time. “The object detection engine is programmable and has a specialized hardware implementation so that we get high efficiency out of it,” says Thompson. “We also have RISC CPUs attached to those. This is a full programmable environment and provides a significant compute capability that helps with TLD (tracking, learning and detection). We can run things like OpenTLD on the RISC processors.” OpenTLD is an algorithm and open source implementation for tracking of unknown objects in unconstrained video streams that does not make use of training data.
We all know that humans make mistakes and fail to recognize someone, or call someone by the wrong name, but what about the implications of that in computer vision? “It is certainly true that neural networks are fundamentally heuristic, with no absolute measure of correctness,” says Rowen. “After all, humans don’t give absolute answers either. They make informed estimates and guesses. So the quality of a training set is measured by the quality of results when the trained network is tested with an independent test set. The training set not only needs to be large, but also diverse in a way that reflects the diversity of real-world inputs.”
But the results are at least repeatable. “The neural net does not use fuzzy logic and it is a precise algorithm that is being run,” points out Wingard. “You can argue about the amount of precision you need in the results. This is no different than considering an algorithm and the precision of the result. The result is predictable given the input. That does make verification easier, but the definition of success is not precise. So how much verification is required? Do you verify that they did the math correctly? Some people would argue that that is sufficient. The algorithms need to be able to change. There is no system level definition against which to validate.”
This would seem to create problems when integrated into an automobile collision avoidance system. “It is important to recognize that it is difficult to make these systems 100% reliable,” says Jeff Bier, president of the Embedded Vision Alliance. “The most important way to gain confidence is to test them extensively. It is a meaningful challenge but it is surmountable. Mercedes has been shipping high-end cars with automatic emergency braking systems for several years. Think about brand value. How much is the brand worth? What would happen if they brake for the wrong reason at highway speeds and cause an accident? That would be a disaster. Mercedes is not being reckless. It’s being very calculating. They have come to the conclusion that the technology is sufficiently reliable to be deployed with excellent results.”
Reliability also can be increased by using multiple types of sensors. A vision system in an automotive safety application may be using a camera in conjunction with radar. The radar is less susceptible to being fooled by tricks of the light, is more reliable at determining if there is a solid object in your path and how far it is from the sensor. However, it is less helpful in discriminating what the object is. The two together can potentially be more reliable.
Will there be an “ARM” for neural engines?
There are many companies developing IP for implementation of neural networks today. Will there be a hierarchy of competitors as there is with CPU cores, with ARM far and away the leader? “Too early to say,” says Bier. “It is sufficiently different from the way that people have been doing things. There is a good potential to add value at every layer, at silicon by having the right kind of processor architecture that can do the job efficiently, or at the top of the stack by collecting and labeling the data for training—and to be able to do the training efficiently so that it takes a day rather than a week to do a training cycle and various layers in-between. We are only seeing the very early phase of this right now, but there are lots of opportunities and companies starting to step up. This is a sufficiently different technology that it merits its own ecosystem and that is beginning to form.”
“We expect to see network structure grow in sophistication and specialization, as better tools and methods emerge for optimized network structure discovery,” says Rowen. “I do not think it will necessarily be dominated by a few providers, but the environments for neural network development (network optimization, network training and mapping to target platforms) might become an important design segment.”
What does an engineer need to know? “For most systems, the neural network is just a function,” says Wingard. “It may be standalone for some time, but it is destined for integration where it will be a big black box in the SoC. Nothing else goes away. Just a new subsystem that has some ports and needs certain kinds of access. There are only a small number of people on a team who will have to understand anything about it.”
Has the time for neural networks finally come? Yes, says Wingard. “So many crazy technology ideas just die, but here is one that has been on the back burner for many years with some highly visible attempts in academia and research labs, and this one has actually made it out. It looks as if it will become highly valuable.”