Making Waves In Deep Learning

How deep learning applications will map onto a chip.

popularity

A little more than two and a half years ago I wrote Making Waves in Low-Power Design, an article about a company (at the time) called Wave Semiconductor. Fast forward the the recent Linley Processor Conference, Wave Computing’s CTO Chris Nicol gave the audience an update on the company’s eagerly awaited and soon (planned for October) to be taped-out 16K-core dataflow processor for deep learning.

wave_chip
Figure 1. Wave Computing Dataflow Processor Chip

Figure 1 above shows a representative layout for their chip. The chip will be manufactured in a TSMC 16nm finFET process. The center portion is a full-custom design and consists of 16K processors with 8K Data Processing Units (DPUs). It is a 9-bit machine that supports signed and unsigned operations with a claimed 181 Peak Tera-Ops. The chip has 4 Hybrid Memory Cube Interfaces (each @60GB/s) plus 2 DDR4 Interfaces (each @15GB/s) for a total peak memory bandwidth of 270GB/s.

In order to achieve the claimed 181 Peak Tera-Ops, the PE’s run at 10GHz with a pipelined 256-entry 10GHz instruction RAM with ECC and a pipelined 1KB 5GHz single port data RAM with BIST and ECC. This all leads to a total of 16MB of distributed data memory and 8MB of distributed instruction memory throughout the chip. Nicol said that Wave designed a 28nm test-chip back in 2012 that ran at 11GHz in the lab. No estimates were mentioned about the possible power consumption for their design. The chip uses asynchronous design techniques and self-timed, MPP synchronization without the use of a global clock or global signals, all helping to enable robust PVT insensitive operation and low voltage scalability. Data is used to wake and sleep the processors so power is automatically reduced when there is no data available for processing.

wave_pe_cluster
Figure 2. Cluster Organization

Wave organizes their PE’s into clusters of 16, as shown in Figure 2 above. Each of the 16 square regions shown in the center of Figure 1 contain 64 clusters for a total of 1K PE’s per region. Each cluster also contains 8 DPU Arithmetic Units that can have a per-cycle grouping into 8, 16, 24, 32 or 64-b operations. Each DPU contains pipelined MAC Units with (un)signed saturation, support for floating point emulation, a barrel shifter, a bit processor as well as support for SIMD and MIMD instruction classes.

wave_flow
Figure 3. Wave Dataflow Mapping

Wave’s goal of inventing the Dataflow Processing Unit (DPU) architecture was to accelerate deep learning training by 10x. Figure 3 shows how Wave envisions deep learning applications mapping onto their chip. Wave Computing believes dataflow applications run better on a dataflow machine and that’s what they set out to create with their new design. When asked whether the machine was designed more for training or for inferencing, Nicol replied that ultimately it’s all dataflow, so it can do training or inferencing or even run a hybrid mix. The flexibility of the configurability of the on-chip resources allows it to tackle many different applications. Nicol said that, deep learning is exciting and moving quickly. The time from algorithm to deployment is happening in months now and Wave wanted to be able implement any new algorithm or number format (8-bit log, looking at 16-bit log for training). Wave sees their dataflow computer also being applicable to searching, sorting, encryption, decryption as well as deep learning inferencing and training.

Evaluation systems are planned for Q2 2017.

Related Stories
Building Chips That Can Learn
Machine learning, AI, require more than just power and performance.
Plugging Holes In Machine Learning Part 2
Short- and long-term solutions to make sure machines behave as expected.
What’s Missing From Machine Learning Part 1
Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.
Convolutional Neural Networks Power Ahead
Adoption of this machine learning approach grows for image recognition; other applications require power and performance improvements.
Inside AI And Deep Learning
What’s happening in AI and can today’s hardware keep up?
Inside Neuromorphic Computing
General Vision’s chief executive talks about why there is such renewed interest in this technology and how it will be used in the future.
Neuromorphic Chip Biz Heats Up
Old concept gets new attention as device scaling becomes more difficult.
Five Questions: Jeff Bier
Embedded Vision Alliance’s founder and president of BDTI talks about the creation of the alliance and the emergence of neural networks.