The Secret Life Of Accelerators

Unique machine learning algorithms, diminished benefits from scaling, and a need for more granularity are creating a boom for accelerators.


Accelerator chips increasingly are providing the performance boost that device scaling once provided, changing basic assumptions about how data moves within an electronic system and where it should be processed.

To the outside world, little appears to have changed. But beneath the glossy exterior, and almost always hidden from view, accelerator chips are becoming an integral part of most designs where performance is considered essential. And as the volume of data continues to rise—more sensors, higher-resolution images and video, and more inputs from connecting systems that in the past were standalone devices—that boost in performance is required. So even if systems don’t run noticeably faster on the outside, they need to process much more data without slowing down.

This renewed emphasis on performance has created an almost insatiable appetite for accelerators of all types, even in mobile devices such as a smart phone where one ASIC used to be the norm.

“Performance can move total cost of ownership the most,” said Joe Macri, corporate vice president and product CTO at AMD. “Performance is a function of frequency and instructions per cycle.”

And this is where accelerators really shine. Included in this class of processors are custom-designed ASICs that offload a particular operation in software, as well as standard GPU chips, heterogeneous CPU cores that can work on jobs in parallel (even within the same chip), and both discrete and embedded FPGAs.

But accelerators also add challenges for design teams. They require more planning, a deeper understanding of how software and algorithm works within a device, and they are very specific. Reuse of accelerators can be difficult, even with programmable logic.

“Solving problems with accelerators require more effort,” said Steve Mensor, vice president of marketing at Achronix. “You do get a return for that effort. You get way better performance. But those accelerators are becoming more and more specific.”

Fig. 1: Accelerator on a chip. Source: Stanford

Accelerators change the entire design philosophy, as well. After years of focusing on lower power, with more cores on a single chip kept mostly dark, the emphasis has shifted to a more granular approach to ratcheting up performance, usually while keeping the power budget flat or trending downward. So rather than having everything tied to a single CPU, there can be multiple heterogeneous types of processors or cores with more specialized functionality.

“There is now more granularity to balance a load across cores, so you can do power management for individual cores,” said Guilherme Marshall, director of marketing for development solutions at ARM. “These all require fine tuning of schedulers. This is a trend we’ve been seeing for awhile, and it’s evolving. The first implementation of this was big.LITTLE. Now, there is a finer degree of control of the power for each core.”

This may sound evolutionary, but it’s not a trivial change. Marshall noted this required changes to the entire software stack.

Concurrent with these changes, there is an effort to make software more efficient and faster. For years, software has been developed almost independently from the hardware, conforming to a set of programming interfaces and focusing on the prioritization and scheduling of different operations. The result has been bloated software code that is functionally correct, but which is slow and uses too much energy per operation. That software also has become so complex that trying to make it secure is nearly impossible.

The push toward software-defined hardware is an outgrowth of this problem. But it also reflects an underlying issue. The flow of data, and where it gets processed, need to be rethought as the volume of that data grows.

“There is a ton of work going on in universities on acceleration,” said Mike Gianfagna, vice president of marketing at eSilicon. “At one level, this presents another challenge for IP reuse because it requires a different level of integration. It may involve the chip, interface chiplets, a high-bandwidth memory stack and different substrates. And the accelerators themselves need to be more robust.”

Machine learning mania
Accelerators are best known for their role in machine learning, which is seeing explosive growth and widespread applications across a number of industries. This is evident in sales of GPU accelerators for speeding up machine learning algorithms. Nvidia’s stock price chart looks like a hockey stick. GPUs are extremely good at accelerating algorithms in the learning phase of machine learning because they can run floating point calculations in parallel across thousands of inexpensive cores.

As a point of reference, Nvidia’s market cap is slightly higher than that of Qualcomm, one of the key players in the smart phone revolution. According to the latest statistics from Yahoo, Qualcomm’s market cap is $78.4 billion and its price/earnings ratio is 17.7. Nvidia’s market cap is $99.4 billion and its P/E ratio is 56.09. A year ago, Nvidia’s stock price was $57 per share. It is now trading at $167.

Fig. 2: Nvidia’s stock price over the past decade. Source: Yahoo

Nvidia doesn’t own the whole pie, though. The next phase of machine learning is inferencing—taking the learned behavior of devices and using it to make decisions within defined parameters. That is essentially a Gaussian distribution of possible actions based upon those learnings, and it plays to the strengths of FPGA and DSP accelerators, which are better at fixed-point calculations.

The common denominator here is that both of these phases require accelerators, and accelerators are nothing new. They have been around for decades. Intel’s 8087 floating point co-processor, introduced in 1980 as a companion to the 8086 CPU, is a case in point. Machine learning has been around even longer, dating back to the late 1950s.

It’s the marriage of these two worlds that is creating so much buzz. The two of them together seem to be at the epicenter of a number of new and rapidly growing markets, including automotive electronics, smart manufacturing, and increasingly semiconductor design.

“There are three aspects to machine learning—sense, decide and act,” said Kurt Shuler, vice president of marketing at ArterisIP. “The innovation is in the ‘decide’ part. Right now there is no best architecture, but there is a lot of innovation on the algorithm and what people choose to accelerate. This is a form of constraints management. You have to use a hardware accelerator because you have a limited power budget and you need near real-time performance. And with machine learning, you can’t come up with one algorithm that does everything. So you create new algorithms, and they don’t run well on standard CPU cores or GPUs or SIMD. And then you have to create an accelerator.”

Many architectures, one purpose
While accelerators accomplish the same thing, no one size fits all and most are at least semi-customized.

“On one side, there are accelerators that are truly integrated into the instruction set, which is the best form of acceleration,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “In the last couple years, we’ve also seen accelerators emerge as separate blocks in the same package. So you may have an FPGA and an SoC packaged together. There also can be semi-custom IP and an FPGA. IP accelerators are a relatively new concept. They hang off the interconnect. But to be effective, all of this has to be coherent. So it may range from not-too-complex to simple, but if you want it to be coherent, how do you do that?”

That’s not a simple problem to solve. Ironically, it’s the automotive industry, which historically has shunned electronics, that is leading the charge, said Mohandass.

New types of accelerators are entering the market, as well. One of the reasons that embedded FPGAs have been gaining so much attention is that they can be sized according to whatever they are trying to accelerate. The challenge is understanding up front exactly what size core will be required, but the tradeoff is that it adds programmability into the device. So it can be programmed to keep up with protocol updates and different software, but rightsizing these devices takes some educated guesswork.

“Accelerators are becoming more and more specific,” said Kent Orthner, systems architect at Achronix. “Traditionally, these were far from processors and memory. We’re seeing them moving closer to the CPU and main memory. In the network, smart NICs (network interface cards) are being included in the algorithm so they can do deep packet inspection before they give that data to the main processor.”

The data center is another growth area for eFPGA accelerators, in large part because of the disconnect between chipmakers and how those chips will ultimately be used. This is why Intel’s purchase of Altera makes sense. Adding programmable elements into packages or even nearby on a board allows these devices to handle data types and use cases that a standard CPU cannot.

“The basic rule is that the closer to the end application you are, the better picture you have about where to go next,” said Geoff Tate, CEO of Flex Logix. “But the ASIC supplier really has no idea what’s going on in the data center. The chip companies are in the middle of all of this. The data centers are now asking for programmability because the protocols are changing. They want to be able to build a data center and not replace their chips. So if a protocol changes, you don’t have to replace a switch or network interface chips. You just change the protocol. This is critical because as the data centers become bigger, they can have their own protocols.”

Accelerators in context
The semiconductor industry has always been focused on solving bottlenecks, which is why there are new materials such as cobalt and ruthenium being considered for interconnects, and why finFETs have replaced planar transistors. But unlike the past, where copper replaced aluminum wires, there are many things that need to be fixed at once—and it’s all getting more expensive..
Moreover, just adding more transistors into a design doesn’t necessarily buy more performance, because routing congestion and contention for memory can reduce the benefit of increased transistor density. Similarly, scaling memory chips doesn’t necessarily produce faster memory.

“In the mid 1990s, we saw a slowdown in DRAM scaling,” said Craig Hampel, chief scientist at Rambus. “Historically, we were seening 35% CAGR, but by 2010 that had dropped to 25%. That created a gap in cost per bit. That created the need for a new memory solution. But every new solution has three essentials. First, it has to satisfy functional needs of memory, which is basically block size and cost. Second, it needs a ubiquitous interface. With some solutions, the latency and block size was too high. And third, you need software awareness to take advantage of memory.”

What is required is a system-level rethinking of a heterogeneous architecture, which includes both memory and processing elements. In the data center, this involves a couple of different approaches. “A big change is software-defined storage,” said Nick Ilyadis, vice president of portfolio and technology strategy at Marvell. “SSDs are replacing HDDs because they offer better speed and lower latency. So what we’re seeing is hyperconverged hyperscale.”

Hyperconvergence involves stacking compute and storage in a vertical configuration. Hyperscale, meanwhile, relies on on adding more units to the network, which also adds more bandwidth. So with hyperconvergence, you scale up. With hyperscale, you scale out.

Accelerators are one more piece of the solution. “They’re not free,” noted Flex Logix’s Tate. “They take up space and they do have to be designed in. But you can accelerate different paths for different customers, and that has huge potential.”

Other applications
Driving the chip industry’s focus on accelerator chips are some fundamental market and technology shifts. There are more uncertainties about how new markets will unfold, what protocols need to be supported, and in the case of machine learning, what algorithms need to be supported. But accelerators are expected to be a key part of solutions to all of those shifts.

“The raw technology will be built into your sunglasses,” said ArterisIP’s Shuler. “Cars are the first place it will show up because consumers are willing to pay for it. But it will show up everywhere—in your phone, and maybe even your dishwasher.”

Already there is significant interest across a variety of markets, including base stations, inside of microcontrollers, in defense and aerospace, and in security for encryption/decryption. GPUs are even being used to accelerate patterning of photomasks for semiconductor manufacturing.

“With massively parallel GPUs you can manipulate any image,” said Aki Fujiumura, CEO of D2S. “With CPUs, you get 22 to 24 cores for $7,000. With GPUs, you get thousands of cores for that price and you can do many different things. You need to select a small number of things you can do, but you can do those things in parallel. We still need CPUs to handle the majority of things. but there are few good things that GPUs are good for, and they are really good at those things. If you’re trying to simulate, you need a SIMD architecture. But nature fits a GPU, and with a GPU you can simulate e-beam behavior as it hits the shape of a mask. At the same time, GPUs are horrible at dealing with an OASIS file.”

Fig. 3: GPU vs. CPU. Source: Nvidia

Rising design costs and diminishing returns on scaling, coupled with a slew of new and emerging markets, are forcing chipmakers and systems companies to look at problems differently. Rather than working around existing hardware, companies are beginning to parse problems according to the flow and type of data. In this world, a general-purpose chip still has a place, but it won’t offer enough performance improvement or efficiency to make a significant difference from one generation to the next.

Accelerators that are custom built for specific use cases are a much more effective solution, and they add a dimension to semiconductor design that is both challenging and intriguing. How much faster can devices run if everything isn’t based on a linear increase in the number of transistors? That question will take years to answer definitively.

Related Stories
New Architectures, Approaches To Speed Up Chips
Metrics for performance are changing at 10nm and 7nm. Speed still matters, but one size doesn’t fit all.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.