The Great Machine Learning Race

Chip industry repositions as technology begins to take shape; no clear winners yet.


Processor makers, tools vendors, and packaging houses are racing to position themselves for a role in machine learning, despite the fact that no one is quite sure which architecture is best for this technology or what ultimately will be successful.

Rather than dampen investments, the uncertainty is fueling a frenzy. Money is pouring in from all sides. According to a new Moor Insights report, as of February 2017 there were more than 1,700 machine learning startups and 2,300 investors. The focus ranges from relatively simple dynamic network optimization to military drones using real-time information to avoid fire and adjust their targets.

Screen Shot 2017-04-02 at 11.29.20 AM
Fig. 1: The machine learning landscape. Source: Moor Insights

While the general concepts involved machine learning—doing things that a device was not explicitly programmed to do—date back to the late 1940s, machine learning has progressed in fits and starts since then. Stymied initially by crude software (1950s through 1970s), then by insufficient processing power, memory and bandwidth (1980s through 1990s), and finally by deep market downturns in electronics (2001 and 2008), it has taken nearly 70 years for machine learning to advance to the point where it is commercially useful.

Several things have changed since then:

  • The technology performance hurdles of the 1980s and 1990s are now gone. There is almost unlimited processing power, with more on the way using new chip architectures, as well as packaging approaches such as 2.5D and fan-out wafer-level packaging. Very fast memory is already available, with more types on the way, and advances in silicon photonics can speed up storage and retrieval of large blocks of data as needed.
  • There are ready markets for machine learning in the data center and in the autonomous vehicle market, where the central logic of these devices will need regular updates to improve safety and reliability. Companies involved in these markets have deep pockets or strong backing, and they are investing heavily in machine learning.
  • The pendulum is swinging back to hardware, or at the very least, a combination of hardware and software, because it’s faster, uses less power, and it’s more secure than putting everything into software. That bodes well for machine learning because of the enormous processing requirements, and it changes the economics for semiconductor investments.

Nevertheless, this is a technology approach riddled with uncertainty about what works best and why.

“If there was a winner, we would have seen that already,” said Randy Allen, director or advanced research at Mentor Graphics. “A lot of companies are using GPUs, because they’re easier to program. But with GPUs the big problem is determinism. If you send a signal to an FPGA, you get a response in a given amount of time. With a GPU, that’s not certain. A custom ASIC is even better if you know exactly what you’re going to do, but there is no slam-dunk algorithm that everyone is going to use.”

ASICs are the fastest, cheapest and lowest power solution for crunching numbers. But they also are the most expensive to develop, and they are unforgiving if changes are required. Changes are almost guaranteed with machine learning because the field is still evolving, so relying ASICs—or at least relying only on ASICs—is a gamble.

This is one of the reasons that GPUs have emerged as the primary option, at least in the short-term. They are inexpensive, highly parallel, and there are enough programming tools available to test and optimize these systems. The downside is they are less power efficient than a mix of processors, which can include CPUs, GPUs, DSPs and FPGAs.

FPGAs add the additional element of future-proofing and lower power, and they can be used to accelerate other operations. But in highly parallel architectures, they also are more expensive, which has renewed attention on embedded FPGAs.

“This is going to take 5 to 10 years to settle out,” said Robert Blake, president and CEO of Achronix. “Right now there is no agreed upon math for machine learning. This will be the Wild West for the next decade. Before you can get a better Siri or Alexa interface, you need algorithms that are optimized to do this. Workloads are very diverse and changing rapidly.”

Massive parallelism is a requirement. There is also some need for floating point calculations. But beyond that, it could be 1-bit or 8-bit math.

“A lot of this is pattern matching of text-based strings,” said Blake. “You don’t need floating point for that. You can implement the logic in an FPGA to make the comparisons.”

Learning vs. interpretation
One of the reasons why this becomes so complex is there are two main components to machine learning. One is the “learning” phase, which is a set of correlations or pattern matches. In machine vision, for example, it allows a device to determine whether an image is a dog or a person. That started out as 2D comparisons, but databases have grown in complexity. They now include everything from emotions to movement. They can discern different breeds of dogs, and whether a person is crawling or walking.

The harder mathematics problem is the interpretation phase. That can involve inferencing—drawing conclusions based on a set of data and then extrapolating from those conclusions to develop presumptions. It also can include estimation, which is how economics utilizes machine learning.

At this point, much of the inferencing is being done in the cloud because of the massive amount of compute power required. But at least some of that will be required to be on-board in autonomous vehicles. For one thing, it’s faster to do at least some of that locally. For another, connectivity isn’t always consistent, and in some locations it might not be available at all.

“You need real-time cores working in lock step with other cores, and you could have three or four levels of redundancy,” said Steve Glaser, senior vice president of corporate strategy and marketing at Xilinx. “You want an immediate response. You want it to be deterministic. And you want it to be flexible, which means that to create an optimized data flow you need software plus hardware plus I/O programmability for different layers of a neural network. This is any-to-any connectivity.”

How to best achieve that, however, isn’t entirely clear. The result is a scramble for market position unlike anything that has been seen in the chip industry since the introduction of the personal computer. Chipmakers are building solutions that include everything from development software, libraries, frameworks—with built-in flexibility to guard against sudden obsolescence because the market is still in flux.

What makes this so compelling for chipmakers is that the machine learning opportunity is unfolding at a time when the market for smart phone chips is flattening. But unlike phones or PCs, machine learning cuts across multiple market segments, each with the potential for significant growth (see fig. 2 below).

Fig. 2: Machine learning opportunities.

Rethinking architectures
All of this needs to be put in context of two important architectural changes that are beginning to unfold in machine learning. The first is a shift away from trying to do everything in software to doing much more in hardware. Software is easier to program, but it’s far less efficient from a power/performance standpoint and much more vulnerable from a security standpoint. The solution here, according to Xilinx’s Glaser, is leveraging the best of both worlds by using software-defined programming. “We’re showing 6X better efficiency in images per second per watt,” he said.

A second change is the emphasis on more processors—and more types of processors—rather than fewer, highly integrated custom processors. This reverses a trend that has been underway since the start of the PC era, namely that putting everything on a single die improves performance per watt and reduces the overall bill of materials cost.

“There is much more interest in larger numbers of small processors than big ones,” said Bill Neifert, director of models technology at ARM. “We’re seeing that in the number of small processors being modeled. We’re also seeing more FPGAs and ASICs being modeled than in the past.”

Because a large portion of the growth in machine learning is tied to safety-critical systems in autonomous vehicles, that requires better modeling and better verification of systems.

“One of the benefits of creating a model as early as possible is that you can inject faults for all possible safety requirements, so that when something fails—which it will—it can fail gracefully,” said Neifert. “And if you change your architecture, you want to be able to route data differently so there are no bottlenecks. This is why we’re also seeing so much concurrency in high-performance computing.”

Measuring performance and cost with machine learning isn’t a simple formula, though. Performance can be achieved in a variety of ways, such as better throughput to memory or faster, more narrowly written algorithms for specific jobs, and highly parallel computing with acceleration. Likewise, cost can be measured in multiple ways, such as total system cost, power consumption, and sometimes the impact of slow results, such as a piece of military equipment not making decisions quickly enough in an autonomous vehicle.

Beyond that, there are challenges involving the programming environment, which is part algorithmic and part intuition. “What you’re doing is trying to figure out how humans think without language,” said Mentor’s Allen. “Machine learning is that to the nth degree. It’s how humans recognize patterns, and for that you need the right development environment. Sooner or later we will find the right level of abstraction for this. The first languages are interpreters. If you look at most languages today, they’re largely library calls. Ultimately we may need language to tie this together, either pipelining or overlapping computations. That will have a lot better chance of success than high-level functionality without a way of combining the results.”

Kurt Shuler, vice president of marketing at Arteris, agrees. He said the majority of systems developed so far are being used to jump-start research and algorithm development. The next phase will focus on more heterogeneous computing, which creates a challenge for cache coherency.

“There is a balance between computational efficiency and programming efficiency,” Shuler said. “You can make it simpler for the programmer. An early option has been to use an “open” machine learning system that consists of a mix of ARM clusters and some dedicated AI processing elements like SIMD engines or DSPs. There’s a software library, which people can license. The chip company owns the software algorithms, and you can buy the chips and board and get this up and running early. You can do this with Intel Xeon chips too, and build in your or another company’s IP using FPGAs. But these initial approaches do not slice the problem finely enough, so basically you’re working with a generic platform, and that’s not the most efficient. To increase machine learning efficiency, the industry is moving toward using multiple types of heterogeneous processing elements in these SoCs.”

In effect, this is a series of multiply and accumulate steps that need to be parsed at the beginning of an operation and recombined at the end. That has long been one of the biggest hurdles in parallel operations. The new wrinkle is that there is more data to process, and movement across skinny wires that are subject to RC delay can affect both performance and power.

“There is a multidimensional constraint to moving data,” said Raik Brinkmann, CEO of OneSpin Solutions. “In addition, power is dominated by data movement. So you need to localize processing, which is why there are DSP blocks in FPGAs today.”

This gets even more complicated with deep neural networks (DNNs) because there are multiple layers of networks, Brinkmann said.

And that creates other issues. “Uncertainty in verification becomes a huge issue,” said Achim Nohl, technical marketing manager for high-performance ASIC prototyping systems at Synopsys. “Nobody has an answer to signing off on these systems. It’s all about good enough, but what is good enough? So it becomes more and more of a requirement to do real-world testing where hardware and software is used. You have to expand from design verification to system validation in the real world.”

Internal applications
Not all machine learning is about autonomous vehicles or cloud-based artificial intelligence. Wherever there is too much complexity combined with too many choices, machine learning can play a role. There are numerous cases where this is already happening.

NetSpeed Systems, for example, is using machine learning to develop network-on-chip topologies for customers. eSilicon is using it to choose the best IP for specific parameters involving power, performance and cost. And ASML is using it to optimize computational lithography, basically filling in the dots on a distribution model to provide a more accurate picture than a higher level of abstraction can intrinsically provide.

“There is a lot of variety in terms of routing,” said Sailesh Kumar, CTO at NetSpeed Systems. “There are different channel sizes, different flows, and how that gets integrated has an impact on quality of service. Decisions in each of those areas lead to different NoC designs. So from an architectural perspective, you need to decide on one topology, which could be a mesh, ring or tree. The simpler the architecture, the fewer potential deadlocks. But if you do all of this manually, it’s difficult to come up with multiple design possibilities. If you automate it, you can use formal techniques and data analysis to connect all of the pieces.”

The machine-learning component in this case is a combination of training data and deductions based upon that data.

“The real driver here is fewer design rules,” Kumar said. “Generally you will hard-code the logic in software to make decisions. As you scale, you have more design rules, which makes updating the design rules an intractable problem. You have hundreds of design rules just for the architecture. What you really need to do is extract the features so you can capture every detail for the user.”

NetSpeed has been leveraging with commercially available tools for machine learning. eSilicon, in contrast, built its own custom platform based upon its experience with both internally developed and commercial third-party IP.

“The fundamental interaction between supplier and customer is changing,” said Mike Gianfagna, eSilicon‘s vice president of marketing. “It’s not working anymore because it’s too complex. There needs to be more collaboration between the system vendor, the IP supplier, the end user and the ASIC supplier. There are multiple dimensions to every architecture and physical design.”

ASML, meanwhile, is working with Cadence and LAM Research to more accurately model optical proximity correction and to minimize edge placement error. Utilizing machine learning, model allowed ASML to improve the accuracy of mask, optics, resist and etch models to less than 2nm, said Henk Niesing, director of applications product management at ASML. “We’ve been able to improve patterning through collaboration on design and patterning equipment.”

Machine learning is gaining ground as the best way of dealing with rising complexity, but ironically there is no clear approach to the best architectures, languages or methodologies for developing these machine learning systems. There are success stories in limited applications of this technology, but looked at as a whole, the problems that need to be solved are daunting.

“If you look at embedded vision, that is inherently so noisy and ambiguous that it needs help,” said Cadence Fellow Chris Rowen. “And it’s not just vision. Audio and natural languages have problems, too. But 99% of captured raw data is pixels, and most pixels will not be seen or interpreted by humans. The real value is when you don’t have humans involved, but that requires the development of human cognition technology.”

And how to best achieve that is still a work in progress—a huge project with lots of progress, and still a very long way to go. But as investment continues to pour into this field, both from startups and collaboration among large companies across a wide spectrum of industries, that progress is beginning to accelerate.

Related Stories
Plugging Holes In Machine Learning
Part 2: Short- and long-term solutions to make sure machines behave as expected.
What’s Missing From Machine Learning
Part 1: Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.
Building Chips That Can Learn
Machine learning, AI, require more than just power and performance.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.
AI Storm Brewing
The acceleration of artificial intelligence will have big social and business implications.


Hellmut Kohlsdorf says:

Machine learning will have tremendous impact on job replacement, and linked to that are the principles of state and social financing. Machine learning, still in its infancy, will be capable of replacing not just jobs with little competence requirements, but also many jobs that today are being considered secure.
Today, financing of state functions are based to a huge degree on taxes taken from salaries, financing of health and social security, as well.
If more and more jobs are being replaced, especially those with higher salaries, with machine learning tools, salaries will be given to less and less people.
I live in Germany, and due to this I see the dramatic impact machine learning will have on salaries being taxed as the main source of income for governments and social infrastructure. I am seeing the first steps toward implementing a basic income for people funded by taxes. So in a world where a lot of jobs will get lost to machine learning, funding will have to take place by putting taxes on functionality that is replaced by machine learning.
I do share the opinion presented in this article that the research, development and implementation has deep pockets, and so the speed of the advances in machine learning is and will be dramatic!

Andrei Pokrovsky says:

FWIW latency determinism is not that big of an issue with GPUs. There’s typically little variability in kernel execution time as long as data stays in GPU memory.

Leave a Reply

(Note: This name will be displayed publicly)