Targeting And Tailoring eFPGAs

Achronix’s CEO zeroes in on new levels of customization, how that plays across different markets, and why embedded programmability is becoming more popular.

popularity

Robert Blake, president and CEO of Achronix, sat down with Semiconductor Engineering to discuss what’s changing in the embedded FPGA world, why new levels of customization are so important, and difficulty levels for implementing embedded programmability. What follows are excerpts of that discussion.

SE: There are numerous ways you can go about creating a chip these days, but many of the protocols are still evolving. Some of the markets are new, while others like automotive and industrial are changing pretty dramatically. And for many of them, there is a lot of uncertainty about which protocol—or which version of a protocol—will win. How does that play into the eFPGA world?

Blake: That is exactly the problem that we see. Clearly, there’s a need to get better efficiency in terms of computing, and simply adding more cores is not necessarily going to solve the problem. You can’t make them go any faster because they’re too power hungry. So you need a different platform that’s going to give you some flexibility because the future is very uncertain.

SE: Where are you seeing the growth? Is it discrete or embedded FPGAs?

Blake: You’ll see growth in both. You’ll still see accelerator attach through PCIe or CCIX (Cache Coherent Interconnect for Accelerators) or OpenCAPI (Open Coherent Accelerator Processor Interface)—any of the interface standards that could be used to attach accelerators to a compute environment. When someone is starting to think about embedding that technology, then they can get an order of magnitude improvement in that latency bandwidth.

SE: Because it’s all sitting on that same chip?

Blake: Yes. You don’t want to go through a narrow pipe, where you have to stand in line and serialize/deserialize if you are interested in having very fast compute. It’s almost like caches on CPU architectures. The reason these evolved over time is because you want very tight coupling between the CPU structure and the memory structure. The same is true if you’re trying to do acceleration. You don’t want to put accelerators far away from the compute resource. You want those very close so that when you make requests the turnaround time is very fast.

SE: Any feedback in terms of it being difficult to work with an eFPGA?

Blake: Nothing unusual. You see trepidation from anyone when they’re trying something new, and that includes never having integrated an FPGA before. It’s like integrating a memory block. There’s an interface to it, it’s a custom block, it has input/output ports, and it has timing you have to meet. After a while they start to see that, from an integration standpoint, it has a lot more similarities to the physical implementation of adding memory structures to ASICs. There is the added complexity of a software element, but engineers quickly realize they’re familiar with how to use the software tools outside in a separate package. It’s not very different to have it inside. It’s a learning exercise, but it’s still a relatively straightforward process.

SE: What is the ‘aha’ moment when an engineer figures out what they can do with it?

Blake: People are very used to what you can do with a conventional CPU environment, but what is different is that they’re not yet familiar with that kind of 30X or 50X improvement in compute performance, which is right there for them to go take.

SE: How do you see this rolling out across different markets?

Blake: The spaces in which we see the most interest are any kind of data center compute and also the new class of edge compute. It’s high-performance compute with very low latency, so this can apply to any application that needs closely attached compute. It provides a different level of capability, both in terms of latency and performance, at a very different price point.

SE: Is it data centers or cloud? The whole premise of the cloud is flexibility.

Blake: It’s both. However, the initial ones are going to be cloud compute resources where those are offered as a service.

SE: Any other markets?

Blake: Yes, wireless infrastructure is another one. If you look at the 5G rollout, there’s a huge amount of uncertainty about how the standard will evolve. Companies are very concerned about building SoCs for this marketplace without some future-proofing. They need flexibility, and an embedded FPGA is the ideal way to do it. It’s an integration play, a cost play and a power consumption play. It’s just a natural fit for that space. A third growth market for embedded FPGAs is networking, and it includes anything that is top-of-the-rack switching. Anytime you get these multiple 10G links—10/25/100G—the classic compute environment for a CPU can’t keep up. The number of cycles you get from a 3 GHz machine doesn’t match those data rates. You need more flexibility, whether that’s new levels of encryption to encryption or deep packet inspection to offload those functions from the CPU. The acceleration sits between the network interface and the CPU resources to be able to pre-process or screen or encode/decode.

SE: How about machine learning, which is more of a horizontal technology?

Blake: That space is very interesting. There’s a lot of interest in GPUs for the learning/training phases of machine learning from a GPU architecture. But if you look at the inference piece of it, that is variable precision fixed point, which is ideal for FPGAs. There has been a lot of research that says one-bit arithmetic is enough to do voice recognition. If you’re doing image or video processing, some pieces of those networks are 8-bit precision. But some pieces may be less than that and some may be more. So now you have this fundamental problem of the depth and composition of the networks is changing. This is variable precision arithmetic. How do you build a single architecture that can span those? An FPGA-like architecture can go from 1-bit wide to 32-bit wide and implement anything in-between.

SE: How granular can you get with all of this?

Blake: We’re introducing a new capability that will enable us to make some dramatic improvements in efficiency. By adding new blocks to the embedded FPGA fabric we’re going to be able to tailor resources inside these blocks. That will make a dramatic impact on die size, power, performance and throughput.

SE: This was always the promise of eFPGAs. What’s different?

Blake: Logic and memory are the staples of accelerators. But if you start to profile different workloads, you find certain things come out as the heavy lifting pieces. For example, you can do multiply/accumulate using a lot of resources, or you could sharply change the amount of resources utilization with customization. We’re seeing that companies would like to add more customization inside the embedded FPGA fabric, and we’re opening up what was in the past just our in-house capability.


Fig. 1: The need for hardware accelerators. Source: Achronix

SE: This doesn’t sound easy from a design standpoint. Is there a steep learning curve?

Blake: In the software world, people are familiar with profiling code and looking for pieces that are being used the most often, and then optimizing and improving those. What we’re doing is a profiling of the accelerators. We’re finding that certain things could be beneficial if they were hardened. We’re going to collaborate with the end customers. Some of our end customers have expertise and know what they’d like to accelerate. This way they can customize around that. In the past, you could change the number of columns and the resources. Now we’re adding another dimension, where you add one or more accelerators that very closely match the applications they want to accelerate.

SE: Let’s go back to target markets. Are there any other markets you’re targeting?

Blake: Yes, we’re appealing to anyone doing any sort of unstructured text matching. If I’m looking at database acceleration, maybe I’d like to match your name in the database with some attributes and see how many instances there are of that. At the lowest level, there’s a binary string that I’m trying to match. In some cases it’s not a perfect match, so you look for the best match.

SE: Like a Gaussian distribution?

Blake: Yes, and how do you do that? In a programmable architecture you can implement many instances of that and simply run them all in parallel. Those kinds of things don’t necessarily fit very well in a GPU environment because that’s a very, very different workload.

SE: That’s a big-data type of approach.

Blake: Yes, and that’s why with the edge-based compute piece of it, people are looking at mining that data to see what type of insights they can gain.

SE: What about industrial control and industrial IoT? Is there a fit there as well?

Blake: Yes, there is. That’s a classic space for FPGAs and control algorithms. In some of those spaces, you have performance levels that fit microcontrollers or CPUs, and in some cases those things are real-time enough where you need predictable real-time response. That takes you more into the automotive classes of automation, where now you need instantaneous performance that is predictable.

SE: How are you doing in automotive?

Blake: It is a new space for us. But given the fusion of different senses and that ability to interpret data across many different things—not just radar and cameras, but sound and motion detectors—embedded FPGAs can help you make predictions about what type of control the vehicle needs. That’s going to take awhile, though, because there are such stringent requirements in automotive.

SE: What about other markets—how do you pick and choose?

Blake: We have to be quite precise. We’ll choose a careful path to build on the success that we’ve had with previous products. The fundamental markets we’ve got are all related. Separately, automotive companies want to know more about what embedded FPGAs can do. But in that space, there’s are challenges to building a product and meeting the reliability constraints of those products, which requires us to make investments in those areas.

SE: How about the support for the infrastructure, the tools, the testing of this kind of a structure where it is use-case dependent a lot of the times?

Blake: If you think of it from a use-model standpoint, if we build any of these fabrics, the path we’re going down is to build an accelerator compiler. We don’t know what kind of blend of resources it is, but we know that there are certain accelerators that fit the 5G wireless space, certain ones that fit more in the CNN space, and certain things that will fit deep-packet inspection and processing Ethernet streams. We have the ability to construct a variable-size accelerator fabric for those different marketplaces in a very fast way. Alongside that, the tools piece is critical because you have to deliver the set of tools that enables that to be programmed. But that’s our fundamental business. That’s what we do. From a testing and manufacturing standpoint, we have to be able to test those structures. In turns out that testing FPGAs is something that is not simple to do. But you have a programmable fabric that you can generate any kind of structures in. The ability to do self-test is a huge benefit. As a customer embeds those functions and their ability to then test those, we supply all the test vectors for manufacturing test.

SE: Where does that hit overhead in terms of performance and power?

Blake: It’s minimal. It’s percentages of overhead in terms of doing that.

SE: You’re moving into mission-critical markets. Any questions about reliability, given this is a new technology?

Blake: In general, these products are already being used in quite high volumes, meaning millions of units. In those kinds of spaces, there’s an expectation that those products will be up 24/7. There are very high requirements for reliability.

SE: Who’s the biggest competition in machine learning and AI?

Blake: The competition is varied. You obviously have the existing FPGA players. In addition, there are GPUs, which have gained a lot of attention in the learning phases of machine learning. But the larger market is going to be on the inferencing side, and the cost and power consumption of those products is going to be critical. That’s why an embedded solution in those spaces will be attractive.

SE: Have you done any power and performance comparisons of an eFPGA versus an ASIC?

Blake: The same comparison you can make is this: In the past people looked at converting an FPGA to a low-cost ASIC because of the cost savings. In that world we would have said, ‘I’m doing an FPGA design as a prototype and eventually and I’m going to harden that prototype because it’s not going to change anymore. How much savings can I make to transfer that to an ASIC?’ In some of those cases, if you build one function with no more changes, you can probably find a 5X cost structure improvement, depending on the complexity. The paradigm shift is that programmable logic, or those kinds of structures that are programmable, were seen as overhead to get you into the marketplace. But you never wanted to use those long-term because once you got them into production you didn’t want to change them anymore. There has been a fundamental change since then. A CPU, DSP or GPU carries that overhead of programmability in the same way an FPGA-like structure carries that overhead. A new requirement is that it can be continuously reprogrammed.

SE: Any plans to create a platform for this? So basically you can develop your own programmable fabric and someone else can add in their secret sauce or function or whatever.

Blake: We’re still going to build our own silicon. We will build package parts that have these acceleration functions. Somebody can use them at a board-level. But we’ve also got customers that want it in a package. In those cases, we manufacture a chip or we license them to manufacture the chip and they do package-level assembly. The third piece of this is to put it on the same die.

SE: What’s the tradeoff?

Blake: In some cases, it may make more sense to integrate at a package level than a die level. There’s a very complex economic picture about whether to go on-die or in-package. If you want the highest bandwidth, lowest latency you go on-die. But if you build a large SoC and put an accelerator on it, that provides the economics of a single chip. With a package you’re making tradeoffs over bandwidth and latency. Depending on what someone wants to solve, we may propose a single die, we may propose a chiplet solution, or we could go silicon interposer or organic substrate.



Leave a Reply


(Note: This name will be displayed publicly)