Teaching Computers To See

First of two parts: Vision processing is a specialized design task, requiring dedicated circuitry to achieve low power and latency.


Digital eye control

Vision processing is emerging as a foundation technology for a number of high-growth applications, spurring a wave of intensive research to reduce power, improve performance, and push embedded vision into the mainstream to leverage economies of scale.

What began as a relatively modest development effort has turned into an all-out race for a piece of this market, and for good reason. Markets and Markets estimates this market will reach $12.5 billion in 2020, up from $8.08 billion in 2015.

Vision processing is showing up in everything from autonomous cars to robotics, and the underlying technology is fundamental to everything from augmented reality—a combination of virtual reality and computer vision—to security and surveillance video analytics, medical imaging, and advanced photography. Already, vision processing has changed the way images are captured, with the amount of advanced technologies inside smartphones truly staggering in a relatively short amount of time.

Capturing and displaying great quality images has become vital for consumer electronic devices, and will continue to permeate the embedded and IoT worlds as companies are pushed to develop more interesting and visually capable devices, said Jem Davies, ARM fellow, during a recent ARM TechCon keynote.

After capturing and displaying the images, the next leap is in how the images are interpreted. Computer vision — teaching computers to see — will be revolutionary as far as knocking down communication barriers between the digital world and the real world, he asserted. And while a picture paints a thousand words, it also uses up about a thousand times the memory.

“More data means more storage,” Davies said. “My favorite statistic [to this point] is the ratio between the duration of video uploaded to YouTube per unit time. We are currently running at 500 hours of video every single minute. Even with better compression, a single 4K security camera will fill up a disk drive in about 24 hours, so storing pixel data is a problem. Transmitting it to be processed elsewhere is even worse. We cannot afford the bandwidth or the power. There’s too much surveillance video being created to sit, watch, and interpret it manually, and certainly not reliably. Can we interpret those images on device to extract meaning from them automatically? If we can do that, we can reduce the data size considerably.”

Vision data generally is analyzed based on how the human brain works, which makes it difficult to copy and model on a computer. This is no small task. It wasn’t until 2012 with AlexNet that computers became better than humans at recognizing the content of images and identifying them correctly.

The steadiest part of computer vision is feature recognition, and the dominant approach for doing feature recognition right now is with neural networks, noted Drew Wingard, CTO of Sonics. “That team [AlexNet] blew away the 20 years’ worth of results in the ImageNet benchmarking suite in 2012 by using a neural network. They made what looked like a step function in improvement in recognition, and everyone that’s won the contest every year since has been using neural networks. So clearly there’s something there that has some significant advantages for the goal, which is accuracy.”

This leads down the winding path of brain-inspired, or neuromorphic, computing, and research is underway to figure out how close to the brain is best.

“Generally speaking, the mathematical operations that are being done take advantage of massive parallelism, lower precision arithmetic, and high connectivity,” said Wingard. “Those are the three key aspects of those architectures. They have the additional crazy benefit of not needing to be traditionally programmed, which is the scary and exciting part. It makes it almost impossible to predict how large a system you need to do a given job. That’s kind of hard, right? You end up running these wild simulations, trying to figure out how many nodes are needed. And if a third of the nodes are taken away, how much does the recognition degrade?”

The learning process is something that’s being hotly debated, as well, including such things as the number of layers of convolution and the functions of the layers. “But what comes up over and over again is energy benefit and performance benefit by almost inverting the design, so that the patterns being recognized are stored close to the units doing the computation. If you’re not careful, you end up with these massive parallel processing arrays being starved for data because they can’t get enough memory bandwidth. It’s always exciting when people talk about rethinking the balance and the architecture of how you think about storage and processing. You can’t think about those two without thinking about the I/O part, as well, because they are talking to each other and they are highly connected. At the same time, they are both massively and sparsely connected.”

For most of these networks, the ideal model is fully connected. “When you look at nature you can imagine that everything is connected to everything,” he added. “But if you actually look at how much attention a given neuron pays to all of its inputs, it pays attention to very few of them. That’s where the sparseness comes in. You can safely ignore most of the connections. So should you build them? That’s the wild part because it’s in the learning process that you find out which connections are needed. So unless you want to build a chip that’s completely specialized to a set of patterns — which most people don’t want to do — you end up wanting a system that abstractly looks fully connected but doesn’t need, and therefore doesn’t build, enough resources to be fully connected.”

Flexibility needed
Adding to the complexity is the fact that vision processing requires a massive number of computations.

“You take the jump from images to video to being able to intelligently recognize video,” said Randy Allen, director of advanced research at Mentor Graphics. “Each of those is a jump in enormous amounts of compute power. You’ve got lots of data, so lots of processing. By definition, at least the way things are, it requires parallel processing. And unlike video, unlike imaging and things like that, there’s not an algorithm out there that you can find that turns around and says yes or no with 100% accuracy, that this is a pedestrian, this is a bicycle, or something like that. Humans can’t do that with 100% accuracy, so it’s one where you need flexibility. You’re fumbling around figuring out the algorithm you want to use, so it needs flexibility there. It’s not something you’re going to go design in an ASIC the first time around because you’re going to be figuring it out.”

This requires lots of computation power, but it also requires parallelism because of the need for conserving power. “That limits you in a lot of things,” said Allen. “What you end up going with most of the time is a GPU with an FPGA assist because that provides the parallelism. There are tradeoffs between the two. You usually get more power at a better price, and it’s definitely easier to program on the GPU side. The advantage on the FPGA is you get deterministic types of things.”

To make matters worse, computer vision hardware also needs to be compatible with different industries. “We have automotive challenges where you might have the power, but obviously these processors come into play in robotics and drones, and there are requirements in terms of power and how much they can consume,” said Amin Kashi, director of ADAS and automated driving at Mentor Graphics. “The challenge is to have something that fits in all these different industries, and that makes it more difficult. On the automotive side, we are dealing with a lot of data, we have multiple streams of video, and obviously we will have other types of information from sensors that has to be fused with the vision information. That brings requirements on the computing and power side that are stricter, which makes it more difficult.”

Layers of complexity
This is hard enough to figure out, but in the automotive world the data from cameras that is the basis for computer vision often is combined with data from other functions.

“The more advanced systems will do one of two things, or sometimes both,” said Kurt Shuler, vice president of marketing at Arteris. “First, they will phone home to a server somewhere, and they’ll get an updated model. They’ll get updated information based on what other cars are seeing, so as you cross into Canada it knows these are Canadian street signs, not American street signs. But even if you didn’t already have that in the database, your car would see it and think it was weird, and then blast it back out to other cars. In fact, there’s even been talk of having some kind of national standard so every car could share that information. And second, it’s not always just about computer vision, so a lot of these machines have something similar to what was in military aircraft (i.e., sensor fusion). You’re getting inputs on what stuff is and probabilistic guesses of what stuff is based on the vision computing. You also have a LIDAR. For longer-range stuff there is radar, for shorter range stuff there is sonar, and navigation systems, so when this car is going down a street it is creating a 3D model of all the stuff in front of it. But in some of these systems, it’s not vision only, and that’s one of the things that adds complexity. But it can be necessary. If you’re in a snowstorm, vision by itself isn’t going see through snow or fog like other sensors can.”

Technically, what’s needed for a vision processor is twofold, explained Gordon Cooper, product marketing manager for embedded vision processors at Synopsys. “First, heterogeneous processing is something that comes up a lot with embedded vision processors because you need a lot of multiplies, which is very DSP-like. But you also need a lot of parallel multiplies, which is more of a vector DSP, and there’s not that many out there. Analog Devices and TI DSPs have been optimized for power for audio and speech. It’s not really optimized for intense parallel processing, and maybe 10 or 20 things at the same time. You’re doing these massive computations for vision because it’s not an audio stream now. It’s a 2D image, which may be running at 30 frames per second. It’s a lot of data to crunch through, and a lot of parallelism. There’s the intense math side, and then of course you still have that computational control side where you have to figure out what to do with this data. I’m tracking an object across the screen. What am I going to do about that? It’s both pieces —- heterogeneous processing includes control and a lot of vector DSP-type of math.”

There are two sides to this, he noted. “There’s a lot of work being done by the Facebooks of the world to use these vision algorithms and deep learning to try to sort through all these pictures that people take, offer things up, and find out if a picture has a certain mood or certain face in it. That’s all done with servers, and probably done by GPUs. Now, a GPU is going to consume a lot of power. It’s going to be a big die. It has the capability, maybe not quite as efficient as a fully customized vision solution, but you get an off-the-shelf, big server rack of these things, and they can crank through stuff. But if you’re going to put this in a toy robot, power is critical, die size is critical, and yet you still need the performance to do all that crunching of the 2D image and the stream of video. To do that effectively, you’ve really got to tailor it for the application.”

From a performance, power and area (PPA) standpoint, one of the reasons why embedded vision processors are so good at low power is because they are optimized to not do all the other things like in a GPU, which are really designed for graphics. “You can do it in an GPU, but they are big and power hungry,” said Cooper. “You can do it in an FPGA. Some of these have DSP slices plus their fabric, and those are good maybe for small volume and for prototyping, but they quickly get expensive if you’re going to do anything in high volume, like adding vision to a toy, for example. You can do it in a DSP, but a DSP is more optimized for audio and speech, and does not have the performance you would need for vision. And, of course, a CPU is big and not optimized at all. As such, if you are doing anything at high volume — automotive, or surveillance cameras, or consumer or retail applications — that’s when a dedicated embedded vision processor really makes sense.”

Dennis Crespo, director of marketing, vision and imaging IP at Cadence, agreed: “Architecturally, the basics of why an engineering team would choose to use a vision processor versus a GPU or CPU is for the same or more performance. The design achieves, on average, 5 to 50X power reduction for the same application.”

“I could talk for an hour on all the little things we’ve done in our instruction set, in our hardware pipeline, in our parallelism to achieve that goal. Basically, we can do the same type of operation that a GPU can do in a thousand cycles in two cycles. And that’s really because we tuned our instruction set for those applications. We have an instruction set that’s tuned for vision and imaging. As a result, we’re not good at doing regular Android calls, and we’re no good at doing 3D graphics.”

Fundamentally, the first challenge in designing vision processors is understanding how the hardware is connected inside the SoC, and how to leverage that. Then, on the software side, how is the DSP accessed? At what level in the software stack? Is it just at the lower level where only, say, the manufacturer of that SoC writes its own software and produces functions that are called that run on that DSP natively? Or is that exposed at the higher level, where an app developer can use it, or an application developer from the outside. That’s yet another consideration in this process, Crespo said.

To be sure, computer vision is complicated. But considering the numerous opportunities to the apply the technology to make the world safer and healthier, the efforts seem well-placed.

Part two will examine techniques, EDA tool requirements, as well as software consideration in the design of computer vision processors.

Related Stories
Seeing The Future Of Vision
Self-driving cars and other uses call for more sophisticated vision systems.
Decoding The Brain
How will decoding the brain of a fruit fly help us build better computers? Lou Scheffer says we have a long way to go before electronics catches up with image recognition.
Neural Net Computing Explodes
Deep-pocket companies begin customizing this approach for specific applications—and spend huge amounts of money to acquire startups.
Rethinking The Sensor
As data gathering becomes more pervasive, what else can be done with this technology?
Embedded Vision Becoming Ubiquitous
Neural networks have propelled embedded vision to the point where they can be incorporated into low-cost and low-power devices.
Five Questions: Jeff Bier
Embedded Vision Alliance’s founder and president of BDTI talks about the creation of the alliance and the emergence of neural networks.
Convolutional Neural Networks Power Ahead
Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.