Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.
While the term may not be immediately recognizable, convolutional neural networks (CNNs) are already part of our daily lives—and they are expected to become even more significant in the near future.
Convolutional neural networks are a form of machine learning modeled on the way the brain’s visual cortex distinguishes one object from another. That helps explain why the most common use today is for image recognition, which is where this market is gaining real traction today. But it also has many potential uses far beyond image recognition as power is reduced and performance is improved.
“The work that happens at a place like Facebook or Yahoo or Google to try to find you in all those pictures people are uploading, that’s largely done with neural networks already,” observed Drew Wingard, CTO of Sonics, noting that driver assistance uses similar technology. “Recognizing the person in the crosswalk or the lamppost you’re about to hit – that’s the same kind of stuff. Computer vision in general is a huge area for these things, and that’s where some of the most impressive results have been shown because it was an area where people had already done a tremendous amount of work. In computer vision, there was so much work that had gone into trying to come up with conventional algorithms for trying to do computer vision that people started applying CNNs to those computer vision problems. They got so much better results with so much less effort that it changed the view of CNNs. Suddenly they look like a practical way to address some of these questions when they previously had never looked. They were always a solution in search of a problem.”
So what else can be done with this technology? Mike Thompson, senior manager of product marketing for the DesignWare ARC Processors at Synopsys, said the company is looking at being able to drive CNN technology into a broader usage in the market. The key hurdles are significantly lowering the power consumption while maintaining the performance level target. That applies to vision, especially for the automotive space, as well as surveillance, general object detection, and augmented reality.
For engineering teams trying to choose the right processor for intense computational activities, maintaining the functionality needed without sacrificing performance or power is the name of the game. This is where CNNs shine.
Steve Roddy, senior product line group director of the IP Group at Cadence, has seen CNNs evolve and explode quite rapidly over the last year or so — and they appear to have significant staying power.
Pierre Paulin, director of R&D for embedded vision subsystems at Synopsys, agreed that CNNs are on a sharp upward trajectory. He said half of the customers he spoke with two years ago were asking about CNNs, but now that number is closer to 95%. “In the automotive space it was relatively new, but some key companies have invested in this space and made products with it and we are seeing this is really happening in the automotive space. Another key aspect of why CNNs are so popular is that you can train them for anything. In the past you had to do special-purpose feature detection for pedestrians or four cars, and everyone developed their own feature detection and that was custom.”
What is a CNN?
Roddy defines CNN as ‘brute force guessing in a structured fashion.’ “People have long said if you had 1 million monkeys, 1 million typewriters, and 1 million years, you could produce the works of Shakespeare. It’s kind of like that.”
CNNs are not new. The first neural network computing products came out 30 years ago with the transputer and several other systolic array things, Roddy noted. “People were saying we would be able to replicate the human brain in just a few years. Of course, it didn’t happen. It faded and in some sense it’s coming back. The reality is you have such massive compute horsepower available.”
With CNNs there is a notion of training and deployment, which are two very different steps, he explained. “The deployment is sort of obvious, such as when you have a particular network, a series of computations, all essentially filtering and decimating data. For example, you take a giant 30-megapixel image and destroy all the data and get down to an answer of one or zero, one being, ‘Yes, I recognize it, and it is the face of a cat,’ or ‘No, it’s not.’ Computationally it’s all heavy duty DSP. It’s very compute bound DSP so it’s right up our alley in terms of an explosion—two orders of magnitude more compute. People need to do it in an energy-efficient manner. But what really enables it is you now have Amazon Web services with ginormous million-processor data centers.”
The basic model that a CNN is following is that of the nervous system.
“It has two interesting characteristics of being massively parallel and highly connected, but with a twist that most of the connections aren’t used,” Wingard explained. “There is a limit as to how many might be used, but the learning algorithm has established a set of weights, and the number of synapses that might attached to a given neuron is essentially unlimited. But the weight that you give to each signal on the synapse is highly variable, and as soon as you get too close to zero, the synapse doesn’t change the behavior of the neuron that much. That’s what makes it so interesting. If you compare that with other massively parallel things — matrix multiply and stuff, that everything is going to be multiplied against everything else — you can build these arrays that are efficiently used. But in a CNN context it is not really that way because so many of the coefficients are zero. Building enough communication to do all of that not-useful communication is not very good. That makes it interesting from an architecture perspective to try to determine what’s the best way, what’s the best hardware architecture to map this concept onto.”
Engineering teams are forever on a quest to choose the right processor for their use case, and maintaining functionality without taking a performance or power hit is paramount. CNNs fit the bill here, and are easier to put to use.
“The reason CNNs are taking off is because they effectively say you don’t need to write programs in order to do complex pattern recognition,” Roddy said. “The challenge is in high-end pattern analytics recognition—whether it’s searching for that one particular face on all the Brussels subway cameras or whether it’s validating your face as you walk up to your front door and it automatically opens your door for you. Or it can involve the four cameras on your car speeding down the highway at 70 miles an hour taking high resolution video at 60 frames per second and trying to figure out where the other cars are, where the lanes are, and what the speed sign say. That’s huge computation, but there aren’t that many people who have both image science backgrounds and really good embedded programming skills. In order to solve something like ADAS in an automobile — I want to be blazing down the highway, recognize traffic signs, and do it in an efficient manner — you have to take the universe of people that have image science background. That’s one part of the Venn diagram. The other overlapping parts would be the universe of people who have good embedded GPU or DSP coding skills. You are looking for the overlap in those two. It isn’t very big.”
He noted that training aspects don’t necessarily involve writing any code. So you may take a known set of data labeled data – 1 million images of street signs, 1 million pictures of cats versus 1 million pictures of dogs, 1 million mammograms, and you label which ones are cancerous and which ones are not. “Essentially you fire up one of these data centers and these training programs fire up lots of candidate potential targets and literally guess for hours and hours at a time. You then realize this randomly generated set of computations, which is 100 more compute-intensive than anything anyone would have written by hand, happens to do a better job of recognizing cancer in mammograms than the best 25-year-trained radiologist. That essentially bypasses the whole bottleneck of not having enough programmers with overlapping skill sets between the domain imaging or stock trading patterns and computer skills. You give it the data, tell it the outcome, and tell it to find the pattern. You’re sort of freeing the owner of the data from the need to know how to program and code it. For the things that drive huge volume — cell phones that can see around them, cars that can see around them — the amount of horsepower needed to make these things run is just enormous.”
This could also be used to analyze everything from stock trading patterns to big data analysis for selling cars to automatic translation with Google glasses. It just so happens that the first and most obvious place to deploy it is visual recognition.
The specs coming out of the auto industry, for example, point to the fact that the computation tends to be eight-bit or even-six bit, because data is being destroyed all the time.
“You’re taking high-resolution images, and the first thing you do is start throwing data away fast and furiously to be able to reduce it down to a quick answer: Is grandma on the sidewalk and not in the crosswalk?” said Roddy. “Yes or no? In the foreseeable future [meaning two to four years] the appetite for computation to implement these CNNs is almost limitless. It’s like a gold rush for GPU people or DSP people or even CPU people to be able to throw more horsepower at a problem because suddenly, for so many more problems, a solution is reachable. You could actually think of having five cameras that are constantly watching around the car looking for threats.”
What this means from a power perspective is that even with the latest and greatest technology of GPUs, which run at approximately 270 watts, to run an ADAS application in a hybrid car, for example, specialized hardware is needed to make this happen. In this use case, 270 watts is not acceptable. “The race is on with the DSP vendors and GPU vendors and others to figure out how do we get 270 watts down to 2.7 watts, because a couple of watts is easily deployable,” he said.
As the researchers learn more about how CNNs actually work, there is a constant evolution of algorithms to cut the number of cycles in half while staying within 1% accuracy of the previous best thing. Breakthroughs are being reported quarterly, and sometimes monthly.
Wingard pointed out that if it’s massively parallel, there’s lots of ways of throwing compute power at these problems. “You need to be careful as always about the ratio between computation and communication. It lends itself relatively well to local computation and relatively controlled distance communication, so that’s good from a power perspective but you’re talking about doing lots and lots of operations and the reason it’s fast is because they can be done in parallel but any way you slice it you’re doing a lot of work so you have to worry about it more. That means the architecture you pick matters.”
As far as bringing the power down for these types of applications given the processing needed, Synopsys’ Thompson said, “If you look at doing this with a GPU for example you’re talking about watts of power and for a lot of applications that’s not tolerable so we have built a piece of IP with a dedicated CNN engine that is fully programmable. The beauty of that is it allows us to implement CNN with a structure that is very close in power consumption as straight hardware but it’s fully programmable and can be used for a wider range.
For processors, so much comes down to architecture. For CNNs, it is the same. Wingard said for CNNs, it is looked at from a number of vantage points. “There’s the creation of the network, which is the training process, where you’re trying to figure out is the coefficients associated with all of the synapses. This means you don’t know how many of the connections are being used. So the architecture that you may want to use there probably looks different from what you use once the neural network has been trained, and you’re simply trying to run a lot of images past the trained network.”
He said the dominant approach today for post training is to use arrays of GPUs. “For good and valid reasons, GPUs have pretty good strong internal communication paths, and obviously they have gone to pretty massively parallel units that are good at doing arithmetic. So it is a pretty good match once the network has been trained.”
The training process is where there are a lot of opportunities for new architectures. There is an interesting question about whether or not the training will happen often enough to make that a very critical stage, Wingard added. “Right now, people use conventional computers for doing most of that work and it takes a lot longer than if they had some specialized architecture, but they don’t do it as often as they might need to. Now we get into real application questions, which are how often do we need to retrain, what are the circumstances that cause us to retrain, and what are the brain models. We’d like to think that our brains are always learning, always training.”
That learning with CNNs is happening now and is ongoing, so a solution today may be quickly outmoded within the year. This is why vendors today are taking the programmable approach, trusting that there will be even more focus on this in the future.
Embedded Neural Network Summit