More Multiply-Accumulate Operations Everywhere

Flex Logix’s CEO points to the growing need for MAC functionality in a variety of new markets.


Geoff Tate, CEO of Flex Logix, sat down with Semiconductor Engineering to talk about how to build programmable edge inferencing chips, embedded FPGAs, where the markets are developing for both, and how the picture will change over the next few years.

SE: What do you have to think about when you’re designing a programmable inferencing chip?

Tate: With a traditional FPGA architecture you have a fully programmable interconnect with the compute element, but very low granularity. You have LUTs (lookup tables) and a single MAC (multiply-accumulate), and then you can wire them up any way you want. We didn’t go with that approach because we believe that in all of these architectures they’ll be processing images of 1 to 4 megapixels. With the customers we’re targeting, even the smallest image might be half a megapixel, which is what you see with an ultrasound, for example. When the images are that big, in neural networks you know you’re going to be doing a lot of MACs. We cluster MACs in groups of 64 and can have up to 1,000 of them connected in a ring. A special interconnect moves data between the clusters within the ring. It’s very flexible, but we trade off flexibility for efficiency by clustering the MACs into groups of 64 rather than relying on a single MAC.

SE: So basically you’ve taken an architectural approach to this problem. How is that working out?

Tate: We’re going to find out over the next year. Nobody really knows what will happen, but our prediction, based upon what we’ve seen so far with customers, is that there will be a lot of weird models out there. Inflexible architectures will have a hard time processing models that are different from what they had originally planned for. And the diversity of models will be big, so we think our architecture will give us a big advantage. We’ve got one customer where their tests showed our chip was 10 times faster than a competitor’s chip, and our chip is double digit dollars versus $2,000 for our competitor’s. It’s also being used for an application that is very different from anything we had anticipated. When the potential customer first showed up on our doorstep the performance wasn’t as good as today. But as our guys studied it, they realized this is a very different model than what they expected. So we thought about it and hooked together the components in a different way. As a result, we were able to get the performance 2.5 times higher. And then we went back in to tune the compiler.

SE: Flex Logix started out as an eFPGA company. This is a significantly different direction. What’s going on with the eFPGA side of the business?

Tate: We are still doing that. In fact, we’re getting a lot of revenue on the eFPGA side.

SE: Is there a lot of overlap between the eFPGA and the inferencing processor?

Tate: If you look at it from the customer application, FPGAs are programmed using Verilog. Neural networks are programmed using ONYX or TensorFlow Lite deep-learning neural-network models. In that sense, they look totally unrelated. But if you go down into our hardware and look at the details, you’ll see that our inference IP is a highly optimized embedded FPGA. So to draw an analogy, the very first FPGAs just had LUTs. There were no MACs. And at some point in time, somebody realized that a lot of their customers were using FPGAs for signal processing, doing multiply-accumulates, so they hardened the multiply-accumulate to get higher performance, reduce silicon area, and give them more bang for their buck. Today, all FPGAs have multiplier accumulators sprinkled throughout. So Cheng [Wang], my co-founder, observed that you can use the MACs in FPGAs — and there are a lot of MACs in FPGAs, which is why some people use FPGAs for inference, — but we could put in a lot more MACs and make it much more efficient. We can cluster 64 MACs with local memory so that we can do a one-dimensional scalar processor, and then hook them up in order to do matrix-multiplies and convolutions of any size. So the hardware guts of this chip has DNA from our embedded FPGA, and about half of our inference IP is in the same blocks that we do for our embedded FPGA. And then we add to it the hardened clusters of 64 MACs in a ring.

SE: Where are you seeing a pickup in your eFPGA business? When they were first introduced, a lot of people were kicking the tires and not really doing much with them.

Tate: To draw an analogy, when I was at Rambus, a lot of people looked at our memory and said it was faster than what they needed. We eventually got there, but our first volume application was Nintendo 64, which was a consumer toy. But it was a very high-volume toy, and that gave us huge credibility and opened up all these other markets we originally planned to go into. Your first adopters are never quite who you expect, but you need a beachhead, and ours has been in the U.S. aerospace market. One-third of the chips that the U.S. aerospace organizations buy are FPGAs, because they need the programmability, the flexibility, and they’re not very high-volume. The problem is that almost all of them are made in Taiwan, and the U.S. government thinks that’s too close to China — and China says Taiwan is part of China. So for assurance and supply, we started doing deals with people like Sandia National Labs and DARPA. We’ve announced projects with other government organizations and aerospace contractors, which are doing a lot of design work now with us, to enable critical chips to be made in U.S. fabs, including GlobalFoundries’ 12nm and 14nm processes. We support those with our technology, and we’re looking at supporting more U.S. fabs. So that’s given us a lot of revenue. And those chips are complex. The design complexity in some cases is hundreds of thousands of LUTs. That’s allowing us to continue to develop tools at the high end of routing capability. We’ve recently announced commercial customers. The two that we can talk about our Morningcore in China, which is a subsidiary of Datang, a big Chinese telecom company, and Dialog, which announced plans to use this in association with mixed-signal chips. They already have mixed-signal programmable chips, but they get more programmability by using our technology. And we’ve got a lot more activity that isn’t public. So the commercial side is starting to grow. The aerospace side is already paying the bill.

SE: On the aerospace side, for a long time those were pretty basic designs developed at older nodes. How sophisticated are these chips getting?

Tate: Just like in the commercial side, not every application needs to have the bleeding edge processes. I can only talk about what’s public, but for Sandia National Labs, one of our first customers, we provided them embedded FPGA for their 180nm fab, which they own and operate. We’ve talked publicly about Boeing using [GlobalFoundries’ 14nm process]. That’s made in New York outside of Albany using a finFET process, so it’s pretty state-of-the-art. There are other fabs in the United States that do 90nm and 65nm. It depends on what the customer is trying to do, whether that’s more advanced signal processing, artificial intelligence, or whatever drives a process.

SE: As you move into the edge with inferencing, what are you finding in terms of understanding the software? AI, by nature, is software-defined hardware, but it can be improved further with an iterative process between hardware and software.

Tate: Our software is where things really diverge. The software we have for embedded FPGAs is totally different on the user end from the software we have for inference products. When we get to place-and-route, we use the same software, but the customer never sees what’s going on underneath. Neural network models are very high-level. Through some simple operator calls they invoke hundreds of billions of computations, whereas in Verilog and RTL, it’s very low-level. It’s like microcode or assembler language. With neural networking, we take care of all of the memory mapping and memory movements. We keep a very high level for the user.

SE: Looking out a few years, where do you see the new opportunities?

Tate: One market we’re just exploring now is signal processing. FPGAs are used a lot for signal processing for things like radios and base stations. And the U.S. government uses a lot of FPGAs. We’ve have customers using FPGAs, and customers that are interested in inference. When we go in and talk with them, the ones interested in signal processing look at our NMAX (neural inferencing engine) and say, ‘Hey, look at all the multiplier accumulators in there. There’s way more multiplier accumulators per square millimeter than an FPGA. Could I use this for signal processing?’ Well, they can’t take their existing RTL that works on an FPGA and run it on our NMAX. That wouldn’t work. But what we have been exploring is showing how FIR (finite impulse response) filters run extremely well on our NMAX IP. So we can do FIR filters at throughputs that are as high or higher than the most expensive FPGAs, but do it in a couple of dozen square millimeters.

SE: So that would push you heavily into the communications space, right?

Tate: On the commercial side, the interest would be in communications. But we’ve seen tester companies and aerospace companies do a lot of signal processing. This is not for our X1 chip, which has the NMAX IP inside of it. The X1 chip talks to the outside world using PCIe, which is a block-oriented transfer bus for processors. What a DSP person would want would be more like SerDes — streams of data coming in to minimize latency.

SE: How many more knobs are there for you to turn here to improve the performance or reduce the power consumption? Have you utilized everything yet, or is there more coming in the future?

Tate: There are architectural improvements ahead. So there are things we can do with our architecture for signal processing to make it run faster, but we’d have to change the current architecture. And for any given architecture, the way to cut power is to go to the more advanced process nodes like 7, 5 and 3nm. If you move from TSMC 16nm to 7nm, the power should be reduced about in half and the performance would go up about 20%. Now, the cost of the masks goes up, and the cost of bringing this to market is more expensive. But that’s how you cut power for a given level of throughput, in addition to any architectural tricks. We’ll be working on both.

SE: How about 5G?

Tate: There are multiple sides of 5G. There’s base stations and the radio. The base station architecture everywhere in the world is the same except when you get to the spectrum, because governments have assigned different spectrum to different carriers. As you get toward the radio part of the base station, that’s where you need programmability and where people use a lot of FPGAs. We hear from some of those players that, for various reasons, using FPGAs adds more overhead. You basically have to have these huge banks of SerDes to get data in and out. If you can integrate that into an ASIC, it would be far more power efficient, which is important because they’re limited on power. So at some point we see embedded FPGAs going into base stations. But also on the signal processing side, we see interest in our NMAX for signal processing for 5g. This is just early discussions at this point, though.

SE: Is medical instruments a long-term play for you?

Tate: We’re looking at ultrasound, MRI, CT scan and X-ray technology. I met with one major manufacturer that makes three out of four of those, and all of them are shipping now with AI algorithms. When we go talk with everyone on the edge, what we hear is, ‘We can do this, but we want to do more.’ What they’re able to do right now is better than nothing. It’s useful stuff. But they are all ravenous for more compute capability within their constraints of power and cost. If we can provide twice as much throughput for the same power and the same cost, that’s great. But they want even more. So it’s going to take multiple generations to catch up to what people want, and by then we’ll probably have better neural network models. This is like the early days of the PC era. People used to debate why anyone would need more than a 10MHz PC.

AI/ML/DL videos
Semiconductor Engineering’s top videos on AI, including Software in Inference Accelerators, Making Sense of Inferencing Processors and more…
New Says to Optimize Machine Learning
Different approaches for improving performance and lowering power in ML systems.
Memory Issues For AI Edge Chips
In-memory computing becomes critical, but which memory and at what process node?


koen weijand says:

great quote: bleeding edge technologies

Leave a Reply

(Note: This name will be displayed publicly)