Speeding Up AI

Achronix’s CEO explains where the main bottlenecks are and how to eliminate them.


Robert Blake, president and CEO of Achronix, sat down with Semiconductor Engineering to talk about AI, which processors work best where, and different approaches to accelerate performance.

SE: How is AI affecting the FPGA business, given the constant changes in algorithms and the proliferation of AI almost everywhere?

Blake: As we talk to more and more customers deploying new products and services, what’s surprising is just how bullish they are about the opportunity and growth rate. We didn’t think the scope of AI/machine learning capabilities was going to grow as quickly or as broadly.

SE: And there are two sides of this, right? There is AI as a tool to develop AI devices and chips, and then there are AI chips.

Blake: Yes, but there also things I wouldn’t have expected would be optimized. You’d expect it in the cloud with images and voice, but not necessarily on the radio base-station side, optimizing the radio network for the number of users you can have simultaneously and the quality of their calls. So there’s optimization of many other systems using AI and machine learning. In general, it’s about pattern recognition.

SE: That’s more or less a way of speeding up data by looking for patterns rather than processing individual bits, right?

Blake: Yes, and when you look at the different pieces, taking a data-center approach to solving different issues is, at some level, very sophisticated pattern recognition we would not otherwise be able to do. So how would you optimize the power or the frequency bands or the directional antennas to maximize those channels? The problems are way too complicated for anyone to recognize those patterns. But now we have the capability to analyze massive patterns in operations of various things, and to optimize those things. In some ways, it’s more sophisticated pattern recognition that individuals or groups of individuals would never be able to conceive.

SE: Is there commonality about how you can apply this? So you may not be just looking at your data. You may be looking at other data, as well, right?

Blake: Yes, and that changes the compute model from you having to predict all of those things. Now you have a lot of information about how a particular pattern problem can be optimized. That can be fed into a convolution and we can optimize from whatever that system is.

SE: Where do FPGAs fit in?

Blake: When you look across all of these problems, you’re still dealing with massive amounts of data. Now we have to do all sorts of new computations with new algorithms that are changing almost on a daily basis. We’re still at the infancy level for this. We still require a massive amount of compute, but it hasn’t settled out yet how that should be done. Can you deploy something that has very high performance and still has enough flexibility? Where the application gets narrowed down for voice recognition for a data center or a call center, it’s sufficiently constrained, so you may be able to build something more tailored for that. But right now that doesn’t exist. It’s still the Wild West. So what hardware should you deploy in the cloud or at the edge in these various aggregation points that gives you the compute uplift, but retains its flexibility? We’re betting on a level of performance while retaining a level of flexibility.

SE: You’re adding another dimension to this, which is how will this evolve over time, rather than what’s the best way of doing this today, right?

Blake: Correct. For some of things there is clear indication for what’s going to happen, and for some of those things we don’t know. FPGAs have evolved from connectivity blocks. A lot of people still equate FPGAs with a prototype. I need to hook up this to this, and then I’m going to build something that’s fixed. What’s happened now is that, given this new requirement for a new level of computation and flexibility, all of a sudden these devices have morphed from a prototyping type of thing into an acceleration type of thing. That was the fundamental shift that caused us to start building something very different. The roots are still there. The technology is still FPGA-based programmable logic. But this has evolved into a very different animal than what it was 10 years ago.

SE: At what point was it a crossover of limited benefits from scaling to, ‘Look what else I can do here?’

Blake: Over time we’ve seen a blending of ASIC and FPGA technologies together. Different technologies do different things well. Load-store CPUs have finite performance. They have great amounts of flexibility but finite amounts of performance. They run into the same kinds of problems as other semiconductors, which is that you have a heat barrier for how much power dissipation there is, so you can only run them so fast. That’s why everyone has gone from single core to multi-core. What an FPGA does very well is it enables you to unroll loops in programs and develop a customized pipeline to run those things very fast. Whether that’s 1-bit arithmetic or 3-bit arithmetic or 16-bit floating point or 8-bit integer, it doesn’t matter. That underlies what the technology is very good at. And then, if you look at the computation requirements going forward, architectures have been morphed. We’ve recognized what some of these problems will be. So how can we leverage the architectures we’ve done to be very good at the computation problem that we’re seeing.

SE: So this computation is looking at strings of computations rather than individual things?

Blake: Yes, and in general what is happening is we’re building very large data sets and then doing computations on them that does that pattern recognition, and that’s happening in a way that’s going to continue to evolve. For a long time everyone was focused on just building a better compute engine. The problem is that you need to look at this holistically. Whether it’s CPUs or GPUs, you need to look at the performance after the operating system and applications are added on top, and what you find is the performance is not very good. When we used to buy dial-up modems or a cell service, at the high level, you would buy 10 megabits per second or a 100-megabit link. But if you look at what actually comes out of the pipe, there’s a big degradation because there are many layers of abstraction that sit on that technology. So on the compute side, if you don’t keep the compute engines working 24 x 7 you don’t get high performance.

SE: Is the key designing from the data out rather than the hardware in?

Blake: Yes, that’s absolutely right. There are three pieces to this. There’s the compute block, which has to be very good. There’s also the data, which is coming from sensors or memory sources. Ethernet is a common pipeline, so you need to think about those interfaces. And then, even after you get it onto a chip, how do you move around that data so it gets to computation blocks efficiently and then it gets dispatched. So you have to look at compute, data transfer and the memory hierarchy interfaces.

SE: The big problem we hear about is getting enough data to keep these chips working full time.

Blake: In general, it’s almost like a production line. If some critical component doesn’t show up, it’s stalled. Even a great pipeline doesn’t help in that case. You’ll find stalls in traditional software in places like caches, but the unexpected things that are going to happen can stall. It’s like putting a big engine in a car, but it doesn’t run faster if you don’t put the right fuel pump in. No matter what, one thing we can be sure about is that things will change. So we have CNNs, and now we have RNNs.

SE: What else are you seeing?

Blake: If you can do pattern recognition with smaller data sets and still get the same results, you will save massively in terms of area and power consumption. That’s a trend we’re also seeing. You have to bake that kind of flexibility into it. If you’re just optimizing for one precision, that doesn’t work. So if you talk to data center customers, they used to care about 16-bit floating point, but that’s not as important anymore. The trend is moving to small precision, which favors FPGAs because they can do 1-, 3-, 8- or 16-bit floating point arithmetic.

SE: Basically, you’re moving closer to how the brain works, where we have numerous inputs but they aren’t all accurate, right?

Blake: Yes, and if you look at the matrix multiplication that underlies a lot of these, a high percentage of the matrices have zero values in them. In those cases, you don’t want to waste your time doing matrix multiplication, so you can pre-process that data and compress it.

SE: What determines what gets processed where?

Blake: That will keep changing. You have to do it very quickly, so you can’t necessarily run it through a CPU because the data streams are so fast.

SE: What’s the solution?

Blake: The piece that is very new for us is a network on a chip. So we are building a highway above the chip to move things around very quickly. You’ve got massive pipes, a 112G SerDes, PCI gen 5, Ethernet, and massive numbers of GDDR memories on the side to feed this. How do you move it all around? In the center, there is an 8 x 10 array that is a logical box. The key is being able to move data sets to any of those points. From a software standpoint, there are basically doors inside of chip, and you can move that data to any of 80 doors.

SE: Any advantage to HBM versus GDDR6?

Blake: The first devices will be GDDR6. Subsequent devices will be HBM. One of the reasons we did GDDR6 is that from a cost standpoint, it’s cheaper. We will be the only FPGA company with support for GDDR6.

SE: Is this going into an edge device?

Blake: Yes, and that will be the sweet spot. But it also can be in the network, so this could be doing packet processing.

SE: And this also could be a fan-out as well as planar?

Blake: Yes. At a high level we focused on three things. There’s a brand new logic structure. We used to have DSP blocks. Now we have a machine learning processor that is very tightly coupled to the memory structures. If you were to do this fractional matrix multiplication, we have an efficient way of doing that. The underlying fabric is a more conventional FPGA arithmetic. On top of that, because of the sheer bandwidth, we put this highway over the top of it to augment what would have been a normal logic structure. That kind of acceleration will be dramatic.

SE: How many of these will be in a server?

Blake: There will typically be two on a form-factor PCI card.

SE: If you scale these or modify them, is there an overhead on how many you put together?

Blake: You can have long-reach SerDes or short-reach SerDes. At that point, you could put down many of them and scale the problem. This is an edge compute play. That’s where this is going to be a game-changer. This is the confluence of things we’ve been working on for the past 10 years.

SE: What do you see this heading in terms of markets?

Blake: We’ll do an embedded version of an FPGA and license that. In this form, it will be used in data center acceleration, particularly in edge computing.

Leave a Reply

(Note: This name will be displayed publicly)