Embedded Vision Alliance’s founder and president of BDTI talks about the creation of the alliance and the emergence of neural networks.
Jeff Bier, founder and president of BDTI, sat down with Semiconductor Engineering to discuss the creation of the Embedded Vision Alliance and the proliferation of neural network technology into embedded systems.
SE: Why was the Embedded Vision Alliance formed?
Bier: About 5 years ago, computer vision was on the verge of becoming a world changing technology. It was becoming possible, for the first time, to implement computer vision in systems that were highly constrained by cost, size and power consumption. There is so much powerful capability that can be delivered through computer vision, be it user interface or security, safety or quality. We have known this for a long time, but for most of that time, it was not practical to implement in everyday products. It was too expensive computationally. But the industry crossed a threshold where it became practical to put sophisticated computer vision into almost anything. We see this today with a wide range of applications in consumer products, which are at the right price points and in battery powered products.
SE: What do you hope to achieve with it?
Bier: A characteristic of this kind of transition is that product developers, who would for the time be able to use vision technology, were not aware of it. They still had the view that it was expensive and out of their reach. 10,000 product developers needed to be made aware that it is in reach and secondly those developers need to get an understanding of how the technology works, what it is and isn’t capable of, the important tradeoffs, the suppliers, what techniques are known to be effective etc.
BDTI started an industry association, which became the Embedded Vision Alliance, to address those challenges—to make product developers aware that vision is a technology that can be incorporated into almost any product and to provide the practical knowledge and skills they need to realize their ideas. At the same time, we provide a natural opportunity for the member companies of the alliance, who are suppliers of enabling technology and services like processors and sensors, algorithms and development tools, to make themselves visible to the product developers.
SE: How important are neural networks to the future of embedded vision?
Bier: Neural nets (NN) are a technology that is emerging very fast. The vast majority of deployed vision systems to date do not use neural networks, but it is also safe to say that the majority of vision system developers today are at least thinking about using them.
SE: What are the biggest challenges with computer vision?
Bier: Contrast computer vision with wireless communications. In wireless communications we understand the physics as to how signals propagate through space and reflect off of surfaces. This means that we have good mathematical models for radio communications. As a consequence, if you provide a roomful of communications designers with a problem, they will tend to converge on the same solution or a small set of solutions—what frequency bands to use, which modulation scheme, what kind of antenna, how much transmit power. The math guides them towards a common solution.
With computer vision, we are trying to emulate human visual perception and we really do not have a good mathematical framework to understand how human perception works. How do humans extract meaning from visual input? As a consequence, there are no algorithms that would cause people to converge, and in fact the opposite is more likely. There are an incredible variety of algorithms because they do not have a framework to guide them toward a common type of solution. They are also really hard problems. It is difficult to figure out how to robustly extract the desired meaning from visual inputs. There are a variety of problems, such as part of the object is covered by another object, the object is backlit, poor lighting, glare, the object is at a strange orientation, a deformable object such as a person that can appear in an infinite variety of poses. When you factor all of these together, it is very difficult.
SE: How are neural net solutions different from earlier embedded vision solutions?
Bier: A car driving down the street at 30 miles per hour sees many objects in its field of view. You want to know, with absolute certainty, if there are any pedestrians in the area where the car is going to be in the next 30 seconds. Something may be tried in the lab using limited test data. It works in the lab, and then it is taken into the field for trials and inevitably it fails. Developers will capture the images when it failed and go back into the lab and diagnose why it failed and then tweak the algorithms. When they do this, the algorithm almost always gets more complex. They may find the approach they took does not work in low-light conditions, so they will put in a mode selector that will determine the lighting condition and use a different algorithm for each. So you wind up with very kludgy and complex algorithms that are constantly being tweaked and refined because you have an infinite variety of inputs. At the same time, you want very reliable output.
The promise of the NN approach is that instead of changing the algorithms, you create or select algorithms that inherently enable the machine to learn and improve over time by being shown lots of examples. You don’t improve the algorithm, although there is potential there, as well. You improve its teaching by showing it more images, both positive and negative examples. Here is a scene where there is or isn’t a pedestrian. The algorithm has a learning capacity built in. It becomes a two-phase thing where you have a learning or training phase that consumes huge amounts of compute power but is typically an off-line process, and then there is the deployment phase where the trained algorithm executes. It can, over time, be retrained with more examples to improve its accuracy. The reason why people are so excited about NN is that they provide an attractive way to address the constant tweaking that has characterized vision applications and algorithms with a more unified framework. Instead of tweaking, we train the algorithm.