Using CNNs To Speed Up Systems

Just relying on faster processor clock speeds isn’t sufficient for vision processing, and the idea is spreading to other markets.


Convolutional neural networks (CNNs) are becoming one of the key differentiators in system performance, reversing a decades-old trend that equated speed with processor clock frequencies, the number of transistors, and the instruction set architecture.

Even with today’s smartphones and PCs, it’s difficult for users to differentiate between processors with 6, 8 or 16 cores. But as the amount of data being generated continues to increase—particularly with images, video and other sensory data—performance will be determined less by a central processor than how quickly that data can be moved between various elements within a system, ore even across multiple systems.

“This notion of vision capabilities that deep learning has brought to the forefront of the industry fulfills a need that people wanted a long time ago, and it has opened the door for new capabilities that existing architectures are not designed for,” said Samer Hijazi, senior design engineering architect in the IP group at Cadence. “For the first time there is an application that existing CPUs and processors cannot handle within a reasonable power budget. In some cases they cannot handle it at all. This has opened the door for CNNs. It’s an exciting opportunity for chip designers. There is finally a new demand on the consumer side for enhanced processing capability that existing architectures cannot offer, and this has triggered a flurry of of new startups.”

The most important power impact from the CNN is for the ADAS application, said Norman Chang, chief technologist for the semiconductor business unit at ANSYS. “If you look at ADAS applications, there are so many startups and automotive companies pushing forward for Level 3 and Level 4 certification. There are a couple of companies working on the hardware aspect because they require the intelligence for vision recognition, passenger recognition, voice recognition and face recognition to observe the driver, and look at the surrounding environment. They also have to get data from multiple sensors, including LiDAR, radar and others, along with video and audio systems. All of these recognition requirements mean an intelligent system is necessary, and that often is implemented in GPUs.”

Fig. 1: Different types of neural network architectures. Source: The Asimov Institute

System-level performance
While universities have been studying deep learning algorithms for some time, only recently has it been viewed as a commercial necessity. That has resulted in huge advances over the past couple of years, both on the hardware and on the software side.

“Universities are not in the business of deploying products, so power does not matter as much and in this context,” said Hijazi.

But power matters a great deal in commercial applications such as autonomous vehicles.

“Power is a critical factor for CNNs because the machines that drive them are every place where power is a big consideration,” said Randy Allen, director of advanced research at Mentor, a Siemens Business. “The key part of a CNN is the matrix multiplication. For 30 years, the key research around making matrix multiplication was central to huge numbers of problems in the world, so it’s very well studied. And for 30 years the key thing with speeding up matrix multiplication and getting matrix multiplication so you can execute problems the size that you encounter with this stuff is the speed of the memory. All of the research has been focused around how to isolate so you don’t have to do so many memory references back-and-forth. ‘How do I do it so I take one part of the matrix to get just the stuff I need to compute, compute it and keep all the data there, and then throw it out and move onto the next part?’ That’s how you make it fast, and that’s also how you make it lower power because it’s toggling memory.”

General-purpose CPUs are not the processor of choice for multiple-accumulate operations.

“CPUs have been focused on faster clock speeds and supporting wider bit resolution, going from 32 bits to 64 bits, and being able to support more instructions, branching and dedicated operations,” Hijazi said. “It’s about doing logic faster. This particular area starts with algorithms in need of a wide array of multipliers and accumulate.”

So far, it’s not clear what is the best platform for this. Single instruction, multiple data (SIMD) architectures, discrete and embedded FPGAs, DSPs and GPUs have been battling it out over the past year for dominance in this space. The jury is out as to which architecture works best where. But the end goal is relatively straightforward. There are significant power and performance tradeoffs associated with CNNs, and there are some fairly obvious ways to separate the issues associated with design efficiency in neural networks.

“Fortunately the metrics are fairly straightforward,” said Chris Rowen, CEO of Cognite Ventures. “It’s not very esoteric or abstract or fuzzy. You actually can benchmark the systems in a fairly straightforward manner. Performance and power and cost are fairly predictable if you know the key parameters of the neural network that you want to execute.”

The design team needs an acceptable performance range, however. “It’s not necessary for them to know exactly what neural network structure they’re going to run,” Rowen said. “In fact, you sort of hope that they’re evolving their network, that they’re figuring out better networks that are some combination of giving them better quality of results or are consuming fewer resources in memory or compute to do it. For this reason, flexibility will be part of it. But you still need to know whether you need 100 billion operations per second or 1 trillion or 10 trillion or 100 trillion operations per second for the class of networks that you think you’re going to do.”

The choice may be supporting 5 trillion operations per second, but someone may come along with a better network that requires more computing and suddenly it doesn’t work anymore.

“Knowing what level of computation you are aiming at is important,” said Rowen. “It’s worth noting that the computational capabilities of platforms have been on a very steep rise so that people can dream about a level of performance that’s really extraordinary so that if you decide you need 1 trillion multiplies per second, that’s not a big deal inside of a chip. So in the right kind of platform, having very significant amounts of compute is not a big barrier. But you do need to know what order of magnitude of compute you want to do.”

That applies to the type of engine used to run the neural network, as well. What is the raw compute per watt for the type of data that you’re operating on?  He pointed out that there is a big variation between the cost per operation for 32 bit-floating point vs. 16-bit floating point, and that’s different than 16-bit fixed point or 8-bit fixed point.

“Almost everything that people are doing in the mainstream in inference or training falls into one of those four buckets,” Rowen said. “There are interesting hybrids and intermediate points, but those four cover what most people are considering. The difference in power between an 8-bit integer or fixed point operation and a 32-bit floating point operation might be as much as a factor of 10, so you are going to pay very differently. People should choose the lowest resolution that they are confident will solve their problem with sufficient accuracy. For this reason, many platforms support multiple data types.”

A second-order consideration is that if every possible data type is supported, the power will tend to be higher. Computation tends to look like the highest-power computation that is supported, so it’s not always possible to support every type of every data type. The sheer presence of all the other options tends to bloat the design, which carries a penalty in cost and power even if they are not being used. That means design teams need to determine which data type to support and the range of that data. Extra degrees of flexibility are a type of guard banding.

In addition, no real-world network runs at 100% efficiency because all platforms have some degree of flexibility, said Rowen. They are all programmable in one way or another. Some layers in a neural network do no multiplies at all, so the fact that there are multipliers present means they’re going to be idle. The degree to which the hardware is balanced to fit the profile of what’s actually used across a range of different neural network structures does become important, and it often depends quite a bit on what the usage model is. 

Different platforms will have different energy efficiency levels even with the same utilization. One reason why that occurs is because more or less flexibility can be built into the architecture.

“Because neural networks are so dependent on multiply, people tend to fixate a little bit on how many multipliers you have, how many multipliers per core, how many multiplies per cycle, how many multiplies per second,” Rowen explained. “It’s not a bad metric, but if someone has a multiplier, is it in a very simplistic structure that allows it to be fully utilized only under certain very simple conditions? Or do you have a lot of control complexity that allows you to use that multiplier in lots of different circumstances? What you find is a tradeoff in the architectures. You tend to have architectures that have very high utilization of the multipliers, but they also tend to be less efficient because there’s other logic gate complexity around that multiplier. Think of it as more plumbing to get the right data to the multiplier at the right time under all possible circumstances.”

The fundamentals
At its most basic, CNNs are a combination of multiply-accumulate operations and memory handling.

“You have to be able to get the data in and out quickly so that you don’t starve your multiply accumulators, and of course your multiply accumulators have to be efficient and small,” said Gordon Cooper, product marketing manager for embedded vision processors at Synopsys. “When it comes to the tradeoffs, first of all they’re going to want a level of accuracy for their CNN algorithm. Different engineers have different sorts of targets. If you were doing object detection or maybe scene segmentation, where you want to identify every pixel in the scene, there are different graphs. Each graph might have a different characteristic, but at the end of the day there’s a certain amount of multiplies to accomplish that graph. Then, the hardware has to be able to accomplish those multiplies. So the very first tradeoff is performance. You have to have a level of performance that gives you the right accuracy, because if you don’t have that then you’re not accomplishing what you want. If you can’t see that pedestrian at a certain distance away or handle the megapixels of the camera that has seen the pedestrian, then you can’t accomplish anything. So performance is really the key.” 

Power is a close second, though.

“If power is no issue, you get a GPU and have it liquid cooled,” said Cooper. “But if you have power issues — and power and area are tied hand-in-hand in a lot of cases even though there are some differences — you want small area and small power, so you’ll want to optimize a CNN as much as possible. This is where the tradeoffs start to happen. If you have a CNN, you can actually just hardwire it to make it hardware. It won’t be programmable, and it will be as small as possible, like an ASIC design. The problem with that is it’s a moving target. There’s always new research and new CNNs coming out, so they’re going to have to be programmable—but programmable as efficient as possible to get a small as possible.”

Looking ahead, Cooper said CNNs are starting to be seen in areas outside of vision. “Traditionally the use of the CNN is where there is a still image and uncompressed frame, basically a two-dimensional image, and you are extracting information out of that where the computer is trying to infer what’s in that image. This has been done in vision applications with cameras, but we are seeing radar which is not a traditional 2D image. We are seeing audio along with other use cases. CNNs are starting to get so prevalent that they’re bleeding into other areas outside of just images and that gets very interesting.”

As an example of the ongoing evolution of deep learning architectures is the idea of recurrent neural networks (RNNs), which add temporal elements, he noted. “If you have a two-dimensional image and it’s frozen in time, what happens in the next image or the image after that? You have this idea of time or a temporal component, and that’s where the current neural network adds that capability. CNNs aren’t going away, but we are seeing recurrent neural networks starting to gain some attractiveness for speech and audio and other areas. Or if you want to combine a CNN, where you are capturing what’s in the image, with an RNN to find out what happens over time, now you can actually do a video caption to describe what’s happening in the video. It’s not just a guy holding a guitar. It’s a guy playing a guitar. That’s the near future,” Cooper said.

CNNs are just one of many possible approaches being suggested for moving large quantities of data. As the amount of data continues to balloon, entirely new architectures are being suggested.

“Moore’s Law does not work anymore for modern scaling,” said Steven Woo, distinguished inventor and vice president of solutions marketing at Rambus. “The growth of digital data is happening far faster than device scaling can handle. So if you need to analyze or search through that data, that’s a different need than what existing architectures are built to do.”

CNNs help move the data more effectively between processing elements and memory, but there is also ongoing research into new types of memory and new processor architectures to go along with neural networks, as well as new architectures that move the processing closer to or even into memory.

“This is forcing paradigm changes,” said Woo. “Neural networks are good at approximate tasks. We’re also seeing self-organizing systems, where they don’t necessarily understand the features of the network. This is happening with machine learning, and there is some work in AI and neural networks where they learn features autonomously to handle large data sets.”

—Ed Sperling contributed to this report.

Related Stories
Speeding Up Neural Networks
Adding more dimensions creates more data, all of which needs to be processed using new architectural approaches.
Convolutional Neural Networks Power Ahead
Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
Inside AI And Deep Learning
What’s happening in AI and can today’s hardware keep up?
Inside Neuromorphic Computing
General Vision’s chief executive talks about why there is such renewed interest in this technology and how it will be used in the future.
Neuromorphic Chip Biz Heats Up
Old concept gets new attention as device scaling becomes more difficult.
Five Questions: Jeff Bier
Embedded Vision Alliance’s founder and president of BDTI talks about the creation of the alliance and the emergence of neural networks.

Leave a Reply

(Note: This name will be displayed publicly)