Machine Learning Shifts More Work to FPGAs, SoCs

SoC bandwidth, integration expand as data centers use more FPGAs for machine learning.

popularity

A wave of machine-learning-optimized chips is expected to begin shipping in the next few months, but it will take time before data centers decide whether these new accelerators are worth adopting and whether they actually live up to claims of big gains in performance.

There are numerous reports that silicon custom-designed for machine learning will deliver 100X the performance of current options, but how well they function in real-world tests under demanding commercial use has yet to be proven, and data centers are among the most conservative adopters of new technology. Still, high-profile startups including Graphcore, Habana, ThinCI and Wave Computing say they already have early silicon out to customers for test. None has begun to ship or even demonstrate those chips.

There are two main markets for these new devices. The neural networks used in machine learning put data through two main phases, training and inferencing, and different chips are used in each of those phases. While the neural network itself usually resides in the data center for the training phase, it may have an edge component for the inferencing phase. The question now is what type of chips and in which configuration will produce the fastest, most power-efficient deep learning.

It appears that FPGAs and SoCs are gaining traction. Those data centers need the flexibility of programmable silicon and the high I/O capability that helps FPGAs play into the high-data-volume, low-processing-power requirements of both training and inferencing, according to Jim McGregor, president of Tirias Research.

FPGA setups are used less often now for training than they were a few years ago, but are they used much more often for everything else, and they are likely to continue growing throughout next year. Even if all the 50 or so startups working on various iterations of a neural-network-optimized processor delivered finished products today, it would take 9 to 18 months to show up in the production flow of any decent-sized data center.

“Nobody with a datacenter is going to buy something off the shelf and put it on a production machine,” McGregor said. “You have to make sure it meets reliability and performance requirements, and then deploy it en masse.”


Fig. 1: Deep learning chipsets by type. Source: Tractica

Still there is an opportunity for new architectures and microarchitectures. ML workloads are expanding rapidly. The amount of compute capacity used for the largest AI/ML training runs has been doubling every 3.5 months, increasing the total amount of compute power used by 300,000X since 2012, according to a May report from OpenAI. By comparison, Moore’s Law predicted a doubling of available resources every 18 months, which would end up with a total capacity boost of only 12X.

Open.AI noted that systems used for the largest training runs (some of which took days or weeks to complete) cost in the low-single-digit-millions of dollars to buy, but it predicts that most of the money spent on machine-learning hardware will go to inferencing.


Fig. 2: Compute needs are increasing. Source: Open.AI

This is a huge, brand new opportunity. A May 30, Tractica report predicted the market for deep-learning chipsets will rise from $1.6 billion in 2017 to $66.3 billion by 2025, which includes CPUs, GPUs, FPGAs, ASICs, SoC accelerators, and other chipsets. A good chunk of that will come from non-chip companies releasing their own deep-learning accelerator chipsets. Google did this with its TPU, and industry insiders say that Amazon and Facebook are taking the same path.

There is a heavy shift toward SoCs rather than standalone components, and increasing diversity in the strategies and packaging of SoC, ASIC and FPGA vendors, McGregor said.

Xilinx, Altera (now Intel) and others are trying to scale FPGAs by adding processors and other components to FPGA arrays. Others, such as Flex Logix, Achronix and Menta, embed FPGA resources in smaller pieces close to specific functional areas of the SoC and rely on high-bandwidth interconnects to keep data moving and performance high.

“You can use FPGAs anywhere you want programmable I/O, and people do use them for inference and sometimes training, but you see them more handling big-data tasks rather than training, which has a heavy matrix-multiplication requirement that is better suited to a GPU,” McGregor said.

The GPU is hardly an endangered species, however. Nvidia expects still to be around after the ML chips ship, but it is taking steps to stay dominant and expand into inference as well, according to Moor Insights & Strategy analyst Karl Freund, in a blog post.

NVIDIA staked its claim with an announcement earlier this month of the NVIDIA TensorRT Hyperscale Inference Platform, which includes the Tesla T4 GPU that delivers 65TFLOPS for training and 260 trillion 4-bit integer operations per second (TOPS) for inference – enough to handle 60 simultaneous video streams at 30 frames per second. It includes 320 “Turing Tensorcores” optimized for integer calculations needed for inferencing.

New architectures
Graphcore, one of the best known startups, is working on a 23.6-billion transistor “intelligence processing unit” (IPU), with 300 Mbytes of on-chip memory, 1,216 cores capable of 11GFlops each, and internal memory bandwidth of 30TB/s. Two come in a single PCIe card, each of which is designed to hold an entire neural network model on a single chip.

GraphCore’s upcoming entry is based on a graph architecture that depends on its software to convert data to vertices in which the numerical input, the function to be applied to them (add, subtract, multiply, divide) and the result are defined separately and can be processed in parallel. Several other ML startups use similar approaches.

Wave Computing didn’t say when it would be shipping, but it did reveal more about its architecture at the AI HW conference last week. The company plans to sell systems rather than chips or boards, using a 16nm processor with 15 Gbyte/second ports with HMC memory and interconnect, a choice designed to push graphs through processor clusters quickly without having to send data through a processor over the bottleneck of a PCIe bus. The company is exploring a move to HBM memory to get even faster throughput.


Fig. 3: Wave Computing’s first-generation Dataflow Processing Units. Source: Wave Computing

One of the best indicators of the heterogeneous future of machine learning and the silicon that supports comes from Microsoft – a huge buyer of FPGAs, GPUs and pretty much everything else for deep learning.

“While throughput-oriented architectures such as GPGPUs and batch-oriented NPUs are popular for offline training and serving, they are not efficient for online, low-latency serving of DNN models,” according to a May 2018 paper describing Project Brainwave, Microsoft’s latest iteration of high-efficiency FPGAs in deep neural networking (DNN).

Microsoft pioneered the widespread use of FPGAs as neural-networking inference accelerators for DNN inferencing in large-scale data centers. The company is using them not as simple co-processors, but as “a more flexible, first-class compute kind of engine,” said Steven Woo, distinguished inventor and vice president of enterprise solutions technology at Rambus.

Project Brainwave can deliver 39.5 TFLOPS of effective performance at Batch 1 using pools of Intel Stratix 10 FPGAs that can be called by any CPU software on a shared network, according to Microsoft. The framework-agnostic system exports deep neural network models, converts them to microservices providing “real-time” inference for Bing search and other Azure services.


Fig 4: Microsoft’s Project Brainwave converts DNN models to deployable hardware microservices that exports any DNN framework into a common graph representation and assigns sub-graphs to CPUs or FPGAs. Source: Microsoft

Brainwave is part of what Deloitte Global calls a “dramatic shift” that will emphasize FPGAs and ASICs to the point that they will make up 25% of the market for machine-learning accelerators during 2018. CPUs and GPUs made up a market that accounted for fewer than 200,000 units shipped during 2016. CPUs and GPUs will continue to dominate in 2018, with sales of more than 500,000, but the total market will include 200,000 FPGAs and 100,000 ASICs as the number of ML projects doubles between 2017 and 2018, and doubles again between 2018 and 2020, Deloitte predicts.

FPGAs and ASICs use far less power than GPUs, CPUs or even the 75 watts per hour Google’s TPU burns under heavy load, according to Deloitte. They can also deliver a performance boost in specific functions chosen by customers, that can by changed along with a change in programming.

“If people had their druthers they’d build things in ASICs at the hardware layer, but FPGAs have a better power/performance profile than GPUs and they’re really good at fixed point or variable precision architecture,” according to Steve Mensor, vice president of marketing for Achronix.

Their attraction is in the things they don’t bring to the data center, however—excessive power draw, heat, cost, latency.

“There are many, many memory subsystems, and you have to think about low power and IoT applications and meshes and rings,” said Charlie Janac, chairman and CEO of ArterisIP. “So you can either put all of those into a single chip, which is what you need for decision-making IoT chips, or you can add HBM subsystems with high throughput. But the workloads are very specific and you have multiple workloads for each chip. So the data input is huge, particularly if you’re dealing with something like radar and LiDAR, and those cannot exist without an advanced interconnect.”

What kind of processors or accelerators are connected to that interconnect can vary greatly because of that need for application specificity.

“There is a desperate need for efficiency at scale at the core,” according to Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “We can put in racks of ASICs and FPGAs and SoCs, and the more budget you have the more we can put in the racks. But ultimately you have to be efficient; you have to be able to do configurable or programmable multitasking. If you can bring multicasting to the vector processing workloads that make up most of the training phase, what you’re able to do expands dramatically.”

FPGAs are not particularly easy to program, and aren’t as easy to plug into a design as LEGO blocks, although they are evolving quickly in that direction, with compute cores, DSP cores and other blocks of IP more common to SoCs than to FPGA fabrics.

But moving from an SoC-like embedded FPGA chip to a full-blown system on a chip with a data backplane optimized for machine-learning applications isn’t as easy as it sounds.

“The performance environment is so extreme and the requirements are so different that SoCs in the AI space are a completely different beast than traditional architectures,” Mohandass said. “There is much more peer-to-peer communication. You are doing these vector-processing workloads with thousands of matrix rows, and you have all these cores available—but we have to be able to scale across hundreds of thousands of cores, not just a few thousand.”

Performance is critical. So is ease of design, integration, reliability and interoperability—characteristics addressed by SoC vendors to focus on underlying frameworks and design/development environments rather than just chipsets adapted to the specific requirements of machine-learning projects.

NetSpeed introduced an updated version of its SoC integration platform designed specifically for deep learning and other AI applications, a service to make integrating NetSpeed IP easier, and a design platform that uses a machine-learning engine to recommend blocks of IP to complete a design. The goal is to provide bandwidth across the whole chip rather than the centralized processing and memory typical of traditional design, according to the company.

“There is everything on the way from ASICs to neuromorphic chips to quantum computing, but even if we wouldn’t have to change the whole basis of our current architecture [to accommodate a new processor], high volume production of those chips is quite far off,” Mohandass said. “But we are both addressing the same problem. While they are working on that from the top down, we are also working from the bottom up.”

CPUs are still the most frequently used data processing element in datacenters, followed by FPGA and then GPUs, according to Geoff Tate, CEO of Flex Logix, who expects the mix of accelerators will change. But he noted it’s unlikely that demand will drop off anytime soon because datacenters try to keep up with demand for their own ML applications.

“Now people are spending a whole lot of money to come up with something that does the same thing better than GPUs and FPGAs,” Tate said. “And the general trend seems to be toward more specialized hardware for neural networking, so that’s where we’re probably headed. Microsoft, for example, says they use everything — CPUs, GPUs, TPUs and FPGAs — according to which can give them the best bang for the buck for a particular workload.”

<strong>Related Stories
Data Center Power Poised To Rise
Shift to cloud model has kept power consumption in check, but that benefit may have run its course.
Hyperscaling The Data Center
The enterprise no longer is at the center of the IT universe. An extreme economic shift has chipmakers focused on hyperscale clouds.
Big Changes For Mainstream Chip Architectures
AI-enabled systems are being designed to process more data locally as device scaling benefits decline.



Leave a Reply


(Note: This name will be displayed publicly)