Machine Learning’s Limits, part 3: Which processor type is the best for training and inferencing, and why are there so many companies trying to build new processors specifically for machine learning and AI?
Semiconductor Engineering sat down with Rob Aitken, an Arm fellow; Raik Brinkmann, CEO of OneSpin Solutions; Patrick Soheili, vice president of business and corporate development at eSilicon; and Chris Rowen, CEO of Babblelabs. What follows are excerpts of that conversation. To view part one, click here. Part two is here.
SE: Is the industry’s knowledge of machine learning keeping up with the pace of development?
Rowen: It’s clear that more theories will help us understand what is really possible and some things about what kinds of network designs will be better than others. At the same time, many of our biggest technological advancements have been when deployments got well ahead of theories. We didn’t develop the theory of thermodynamics so we could build steam engines. We built a lot of steam engines, and then said we need to understand them better. It’s similar with airplanes versus aeronautics. It was at a later stage that we began looking at how to make this systematically good. It wasn’t a matter of, ‘Stop, let’s wait for theory to catch up.’
Brinkmann: Theory comes from deployment, not from someone saying this is important.
Aitken: The engineering bias is always, ‘Let’s build it now and figure it out later.’ That is the right way to proceed in this space.
Soheili: It’s certainly easier.
Rowen: I just came back from a technical conference where half the speakers thought the real measure of progress is how many equations you can shoehorn into a paper. That is quite independent of how much understanding you have.
Brinkmann: There are differences in Russia versus the United States. When people tried to invent supersonic flights, the first parties were going up in planes and then diving down to reach the speed limit. Both parties found the wings were flying off the planes, and unfortunately people lost their lives over this. But the Russians went back to the drawing board to look at the theory, and they found that with the Mach challenge, the problem stemmed from overlapping waves, so they developed wings that were not in the way of those waves. The Americans tried to make the wings stronger and stronger, but that didn’t work, so eventually they had to take a different approach. When engineering doesn’t work anymore, some theory may be necessary to overcome a problem.
SE: What’s the best chip architecture to run machine learning algorithms? GPUs seem to have won the training side, but what happens on the inferencing side? A lot of this will be done at the edge rather than in a data center.
Soheili: I don’t think the story is over on the training side with the GPU, even though the market seems to. There is a lot of speculation about the future and future-proofness of GPUs that allow it to be a good vehicle for training. You need to retain a certain amount of programmability and future-proofness in an architecture so that when you deploy these systems you don’t box yourself into a corner. But having said that, every one of the big companies working in this field agrees that with training, as important as it is, is very very expensive. To do it efficiently, because there is so much of a computational requirement, the more optimized and ASIC-like your approach, the better off you’re going be from the standpoint of speed of learning and cost of learning. GPUs are convenient, but they’re a means to the end. What’s needed is something very specific to what it has to do and what equations it has to solve and what matrix multiplication it has to do, versus a generic CPU or GPU or FPU.
Rowen: It’s abundantly clear this problem has very idiosyncratic characteristics. Neural network training and inferencing are very highly parallel—probably more parallel than almost any practical problem we’ve encountered in the history of computing. It is remarkable that the degree of parallelism and the ability to use one hardware structure across many different types of problems is also very large. You can say, ‘I’m doing speech.’ What’s the right architecture for speech? ‘It’s this.’ And if you’re doing vision, lo and behold, it’s the same architecture. Now you’re predicting what movie you should look at or which search results. It’s the same hardware, as well. This neural network structure is very much like the development of the microprocessor in the sense that microprocessors are good for pretty much anything. You don’t have to say, ‘I need a different microprocessor if I’m running marketing modeling versus payroll.’ That universality is leading to an explosion of innovation. Lots of people are trying lots of different things, and we expect to see lots of innovation on the hardware side. One of the reasons there are 50 companies designing new platforms is that almost every architect looks at this problem and says, ‘If that’s all it is, I can make this run 100X faster than it does on a general-purpose machine.’ All of these teams have discovered they can run at 100X, and so you’re getting all of these rival solutions. 100X is the minimum ante.
Soheili: And therein lies the opportunity and the obstacle. If you are solving for one problem, and one type of problem only, there is one solution and one type of solution. Once you try to do many things at the same time, you have to go to a general-purpose processor like a GPU, and you have to keep your options open.
Aitken: There are a couple other layers of complexity. Convolutional nets are basically matrix multiply the majority of the time. But the question of which matrices, how sparse are they, what transforms can you do on the data to make sparser, those questions are where the 100X is coming from. There’s still the exact same tradeoff that has always existed, which is how much flexibility do I want to change this thing in the future versus how good is it going to be on my current workload.
SE: Isn’t it also how accurate?
Aitken: That fits in there, but that’s almost a given in some of these situations. If I’ve trained this net and I’m willing to implement it to make it screamingly fast in silicon. There are ways to do that where accuracy is irrelevant. The exact same accuracy you have in software is what you have in the hardware implementation. The more interesting question is, ‘Next week I’ve decided to add three more layers to this thing or that I want to collapse it. Now what do I do?’
Rowen: In the grand scheme of things, these are second-order effects. But certainly, neural networks will continue to evolve. The more programmability you have in whatever that solution is, so it can adapt to small or moderate amounts of change that are required of the neural network, the better. So you do face a classic tradeoff with efficiency.
Brinkmann: And not just on the hardware side.
Rowen: But there is a greater cluster of problems that is pretty close together in terms of their hardware requirements. That’s why you do see all of this specialized hardware. There are people building more parallel vector extensions, DSPs, specialized acceleration units.
Aitken: They’re all processors, and they all exist on a spectrum. You may have a programmable CPU on one side.
Soheili: It’s RISC vs. CISC all the way through.
Aitken: Yes, and you have a GPU over here and this Tensor processor in the middle. They’re all solving a very specific problem, and each of the hardware solutions is optimized to do a specific function well. Even if you characterize the machine learning inferencing problems, there’s a set of them that work really nicely on CPUs, there’s a set that work really nicely on GPUs, and there’s a set that is in between.
Soheili: And it depends on which aspect you want to drive. Are you driving throughput? Are you driving latency? Are you driving power or cost. Are you driving two of those. For any one of those, you have multiple axes where you can create a more efficient machine. But you’re throwing out your future-proofness.
Rowen: There’s a key aspect that is an important enabler for semiconductor innovation, which is that the way in which the problems are being described is quite abstract. You have some high-level neural network structure, which captures all of the topology of the network, and which can be mapped to all sorts of things. That’s very different if it’s only available when optimized for x86 or a GPU. The fact that neural networks are described a high level opens up so many opportunities for innovation. It actively encourages people to try new parallel structures.
Brinkmann: One of the big challenges for anyone trying a new architecture is finding a good way of mapping a general description of a neural network into their hardware. The more complexity, the more difficult it will become. If you look at FPGA synthesis, the input is generally RTL, which is what we’ve used in ASICs for a long time. But mapping RTL to ASICs is a much more complex problem because you have deal with a lot of these restrictions. You have specialized blocks, DSPs, block RAM and registers, and you also have want to make use of those. The same will happen if you have a very refined architecture for machine learning. It will become a huge challenge to target it properly. The software will be one of the big challenges.
The general solution –
http://parallel.cc/cgi-bin/bfx.cgi/WT-2018/WT-2018.html
(https://youtu.be/Bh5axlxIUvM)
Neural-network processing is much the same problem as circuit simulation (I’ve been working on that one for a while). Likewise getting new hardware adopted needs software tools to get your old code onto your shiny new machine (Transputers, Cell, Wave?, Tachyum…).
Other stuff coming down the pike –
http://ascenium.com/
https://patents.google.com/patent/US9996328B1/en