Speeding Up Neural Networks

Adding more dimensions creates more data, all of which needs to be processed using new architectural approaches.


Neural networking is gaining traction as the best way of collecting and moving critical data from the physical world and processing it in the digital world. Now the question is how to speed up this whole process.

But it isn’t a straightforward engineering challenge. Neural networking itself is in a state of almost constant flux and development, which makes it something of a moving target. There are more than 20 different types of neural networks today. Some are more in favor one month than the next.

In addition, there isn’t a clear answer for what is the best type of processor to use to drive this technology. The commonly accepted metrics—work done per unit of energy, per millisecond, and for the lowest possible cost—still apply, of course. But they can be weighted differently at different times, both in the development cycle and the real-world applications.

On top of that, some approaches are considered better than others for certain tasks. For example, convolutional neural networks (CNNs), which have emerged as the centerpiece of embedded vision in cars and drones, may be replaced or supplemented by recurrent neural networks (RNNs). An RNN can help distinguish not just whether an object is a dog or a person, but it can determine what it is doing over time. A dog may be moving into the road, or it may be moving away from the road. Or a child may be chasing after a ball that is heading into a busy road. But a CNN will only give snapshots of the movement, while an RNN will provide enough data over a period of time to determine whether a car needs to brake and how quickly.

Each of these approaches has tradeoffs. An RNN provides much better context for understanding moving image data, but it means processing an extra dimension of data. How much of that data gets processed locally versus centrally can affect how quickly a vehicle can react, how much power it consumes, how long the sensor network will be reliable, and the overall architecture of a system. And just to add more confusion, algorithms are still being developed to utilize this data more efficiently, which can affect any or all of these factors.

“Right now, there is no clear-cut algorithm for vision,” said Randy Allen, director of advanced research at Mentor, a Siemens Business. “It has to be highly parallel, but no one architecture is the clear winner. With that much parallelism, you need the right programming environment, too. With hardware, that means you also have to parallelize the bottleneck on memory and cache.”

In fact, there is widespread agreement that much more work needs to be done before the tech industry begins picking architectural and algorithmic winners.

“We’re still early in the process for neural networks,” said Mike Gianfagna, vice president of marketing at eSilicon. “The question being debated now is what algorithm fits best for a particular market. That’s probably not going to be decided by people picking algorithms or dedicated chips and approaches, though. It most likely will be limited to whatever dedicated silicon is available, and from there you will look at speed and power.”

Processor choices
One of the starting points for neural networks involves the processing elements. ASICs are the fastest, most efficient chips for any device. In volume, they’re also the cheapest. But they’re also very specific, meaning changes to designs are expensive and difficult—sometimes even not possible. With the neural network market still in flux, that option doesn’t make sense at this point. When the market matures, ASICs are a strong contender.

The GPU was the initial chip of choice for developing neural networks and machine learning. It’s cheap, easy to parallelize, and the software development tools are relatively mature. The big problem is that it’s not very energy-efficient or reprogrammable, and results are not always deterministic.

FPGAs, in contrast, are bigger and therefore more expensive, but they’re easily programmable, which helps keep up with changes as technology evolves. They also use significantly less power. “With reconfigurable interfaces and programmable logic you can address different layers of a neural network,” said Steve Glaser, senior vice president of corporate strategy and marketing at Xilinx. “You also have software with defined programming that you can use for industry standard libraries like TensorFlow and Caffe.”

A relatively new wrinkle in this world is the embedded FPGA, which can be any size. Perhaps even better, it can combined with other types of processors. “The economics of cost on a die typically translates into area,” said Kent Orthner, system architect at Achronix. “So an FPGA area is larger than an ASIC, but if you need the programmability then you need an FPGA. What an eFPGA allows you to do is to add that into a die with more control over the size.”

DSPs add another programmable option. Unlike other processor types, DSPs use fixed-point processing rather than floating point for the multiply-accumulate computations in a neural network.

“The only advantage of floating point is that you do not have to convert to fixed point from MatLab, but it’s less efficient than fixed point,” said Gordon Cooper, product marketing manager for the Embedded Vision Processor group at Synopsys. “Fixed point uses the least amount of power and offers the highest performance. And this isn’t just for automotive. It’s also for surveillance, drones, moving image cameras and multi-function printers.”

Much of this centers around embedded vision, which generates the most data of any of the senses. Effectively processing that data involves a momentum coefficient, which is essential in “training” the system.

“If you look at the AlexNet MAC (media access control) layer, there is 240 megabytes of coefficient, which is a lot of data,” said Pulin Desai, marketing director for Cadence’s Tensilica Vision DSP. “In fixed point, you can divide that by four in memory accesses. That’s faster, and it also saves in power. It can be optimized for radar, LiDAR, or any sensor. But the data from the vision sensor is the most data. Everything else is sparser than a vision sensor.”

Fig. 1: CNN algorithm development trends. Source: Cadence

Optimizing data flow
Regardless of the type of processor used, all of these approaches still rely on a von Neumann-based computing architecture. The difference is that in a neural network, the computing is more distributed. Some of that is still managed centrally, although exact amounts vary from one application to the next. The bottom line, though is the individual computational elements are well understood.

Fig. 2: Von Neumann architecture.

That makes a good starting point for ongoing research to speed up neural networking, even though it’s not as simple as von Neumann’s architecture might suggest. There can be more levels of memory (cache), which functions something like pre-fetch does in search. As more cores are added to speed up the processing of data, that cache needs to be coherent, as well. But not all of the cache needs to be physical. Some of it can be virtualized as a proxy cache, which can act like another level of cache as needed. That becomes particularly useful in recurrent neural networking, because there is an extra dimension of data that needs to be included in the matrix.

Fig. 3: Deep neural network. Source: OneSpin Solutions.

“Basically, what you’re doing is defining state input and output, and you can keep that state in the proxy cache,” said Kurt Shuler, vice president of marketing at ArterisIP. “This works really well with a recurrent neural network, because you can keep the state of today but put through information from the past. This is not just about processing, though. It’s also how you move current and past data from one processing step to the next.”

The advantage of this kind of approach is that it reduces the latency on coherent memory caching because not everything needs to be stored in the cache or memory, and all of it is accessible to all processing elements on the neural network. “There is spatial locality and temporal locality, and when you cache you want to be able to take advantage of both,” said Shuler. “Associating processing elements with proxy caches determines the data flow, which provides lots of flexibility and architectural options for RNNs.”

A second approach to speeding up neural networks is to train these systems to do more in the same physical space. Much of the current work is based on one of two key software libraries, Caffe and TensorFlow. TensorFlow was developed by Google, Caffe by UC Berkeley’s AI Research Lab. Both are closely associated with AI, which relies on neural networking for bridging computing with the outside world. And this is where fixed point really shows some promise.

“You can take a 32-bit floating point calculation and come close with a 10-bit solution on fixed point and not lose a significant amount of accuracy,” said Synopsys’ Cooper. “The goal is how small you can make this—how many multiply accumulates you can shove into a small space and how you make that more efficient. ImageNet, which is in competition with AlexNet, claims that it got to a 3% error rate, which is better than a human can do. But the general trend is to reduce the number of computations through compression. You want to reduce the number of coefficients and the number of multiplies, or basically the number of computations, through compression or pruning.”

Improving throughput also can be a function of how much processing is done locally versus centrally, a debate that has been underway for some time in the IoT regarding what gets processed by the edge device, by the fog server, or in the cloud. But in an autonomous vehicle, all of this has to happen at much faster speeds. If image sensors in a car have to relay a streaming data feed to a central logic unit, the amount of data flowing through the car’s internal network will be enormous.

“With an imaging DSP, you do the image processing, which includes noise reduction or image correction, and then you send that to a neural networking machine,” said Cadence’s Desai. “You can have a neural networking accelerator next to the DSP and you can accelerate the convolution layer but not the rest of the layer. But both the DSP and the hardware accelerate are working all the time, so moving data between the accelerator and the DSP can use a lot of power.”

For engineers working in the networking space, this is a familiar problem. Networking ASICs had similar challenge when they were first introduced.

“The network processing units had as many as 200 blocks, but they were using a pipelined architecture so each block had a link back,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “Ultimately what they created was a backplane fabric underlying all of the engines. For neural networks to make progress, they’re going to require a similar type of backplane standard to enable all of the components to talk. Everything is not on all of the time, but when they do come on, there has to be a unified view. Typically, what happens is you accelerate starting from the edge and work down to the core.”

Putting neural nets in perspective
What’s different about neural networking is that these networks can be trained to be more efficient, a pattern that follows development of the human brain. An infant has more neurons than an adult—a process known as synaptic pruning—and a successfully designed neural network should be become more efficient or capable over time.

“Networks train image processing and language processing,” said , CEO of OneSpin Solutions. “Deep neural networks consist of several layers of networks. There is a race on for this technology, using multi-dimensional constructs.”

Brinkmann noted that the big problem is still the volume of data. “You want to go from a von Neumann to a data flow architecture. But what is the right architecture?”

So far that isn’t clear, and it probably won’t be for some time. No matter how far scientists and engineers have come with neural networking, and its application to machine learning and artificial intelligence, there are many years of work ahead.

“What people describe as AI today aren’t really all that intelligent,” said Simon Segars, CEO of ARM. “They’re more intelligent than running linear code on something, but there’s such a long way to go in development of these algorithms. We’re not at the point yet where you can unlock the creativity of hundreds of thousands of developers. In the early days of phones, all of the software never left the factory. Then we got to open platforms and application developers could think about use cases. Right now, AI is so specialized that it’s at the point where we’ve only seen a fraction of the applications because it’s too hard. Over time we’ll seen explosion of use cases. But it has a very, very long way to go.”

Related Stories
Convolutional Neural Networks Power Ahead
Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
Inside AI And Deep Learning
What’s happening in AI and can today’s hardware keep up?
Inside Neuromorphic Computing
General Vision’s chief executive talks about why there is such renewed interest in this technology and how it will be used in the future.
Neuromorphic Chip Biz Heats Up
Old concept gets new attention as device scaling becomes more difficult.
Five Questions: Jeff Bier
Embedded Vision Alliance’s founder and president of BDTI talks about the creation of the alliance and the emergence of neural networks.


Kevin Shaw says:

Great article.

Vincent Ratford says:


Clearly this could be a blue ocean for semi’s or a bust like the wave of Network Processors. Hope to continue the discussion at next week’s Embedded Vision Summit in Santa Clara, CA ? Will you be there ?

Leave a Reply

(Note: This name will be displayed publicly)