FPGA Design Tradeoffs Getting Tougher

As chips grow in size, optimizing performance and power requires a bunch of new options and methodology changes.

popularity

FPGAs are getting larger, more complex, and significantly harder to verify and debug.

In the past, FPGAs were considered a relatively quick and simple way to get to market before committing to the cost and time of developing an ASIC. But today, both FPGAs and eFPGAs are being used in the most demanding applications, including cloud computing, AI, machine learning, and deep learning. In some cases, they are being combined with an ASIC or some other type of application-specific processor or accelerator inside a chip, a package or a system. As a result, requirements for effective power, performance, and area (PPA) are every bit as strict as for ASICs and full-custom chips, and the tradeoffs are equally complicated and often intertwined.

“For SoCs with an FPGA, there are several approaches,” said Stuart Clubb, product marketing manager at Mentor, a Siemens Business. “There’s the ASIC team that is building an SoC and adding an embedded FPGA into the fabric for something that’s programmable — invariably hanging off of a bus and used as some kind of programmable accelerator that they don’t quite know what they’re going to do with yet, or which may be changed. For them, the rigors of the ASIC flow are more commonly adopted.”

The second approach is to use an FPGA as a separate chip alongside the ASIC. Data needs to be moved off-chip and back on-chip using one or more high-speed buses that the FPGA vendors provide. “But that’s just a communication mechanism rather than anything about ‘performance’ because it’s just really about moving data at that point.”

A third approach is to embed a processor inside an FPGA. Xilinx has done this with its Zynq 7000 series, which uses a separate Arm chip inside of an FPGA, as well as its MicroBlaze, which is a soft processor core.

Regardless of the approach, FPGAs are physically getting bigger, and so are the challenges associated with that growth.

“It’s harder to debug something that’s bigger,” Clubb said. “The FPGA vendors are trying to introduce things to alleviate that with probing inside, etc., but all the problems still remain of the unpredictability of routing. It has been said that with an FPGA you weren’t paying for logic, you were paying for the routing to be able to use the logic. Unfortunately, while the vendors do make great strides and great claims, invariably what we see is that routing, usability and predictability are still problematic. Your perfect and lovely RTL for your ASIC is probably going to be pretty poor in an FPGA, especially for FPGA prototyping, and in some cases it may not even work —especially if you want to prototype anywhere near at speed. With what’s going on in 5G and machine learning, what you might choose to implement, especially for 5G radio, needs to be massively over-pipelined in the FPGA to prove that conceptually your algorithm and all of your magic and your secret sauce is actually going to work. If you were to take exactly that same RTL and put it in your ASIC, it’s going to be massively inefficient.”

What works best
As with any complex design, there are a number of choices that need to be made upfront.

“To achieve the highest performance, my first answer would be to use the fastest process node possible,” said Geoff Tate, CEO of embedded FPGA provider Flex Logix. “But in reality, when we talk to customers, they’ve already usually fixed on a process node when they come to meet with us because with any chip design, people certainly want a faster chip rather than a slower chip. But they do have other considerations — cost, time to market, IP availability, all these kinds of issues. So they usually tell us, ‘We’re probably going to use TSMC 28 or maybe SMIC 28.’ Once you pick the process, there can be variations of a node. For example, with TSMC, if you look at their 16nm node, they now have at least five variations.”

The No. 1 task for design teams to achieve more performance is to write their Verilog well.

“It’s just like writing processor code,” said Tate. “You can have two people write a program, they can both work, but somebody can write the code and it runs 50% faster than the next person. That’s really up to them. There’s not much we can do to help them. One of the common things we see with most of our customers using embedded FPGAs is that have not done FPGA designs before. So they’ll tend to take a bunch of RTL they developed for their hard-wired chips, and they’ll dump it in the embedded FPGA and say it runs too slow. They need to remember that with an FPGA, every programmable logic element has a flip flop on the output, so the flip flop is free. To get high performance, the Verilog must be modified when moving from hard-wired ASIC to FPGA to put in more pipelining stages. The more design teams invest in optimizing the Verilog, the more performance they’ll get.”

A hard-wired ASIC chip is designed for a certain clock frequency, so when it comes to using an embedded FPGA approach, there should be fewer stages between the flip flops. This is easier said than done, however. There isn’t a button to press to make that happen. However, macros can help in embedded FPGA designs, especially where there are large portions of the design where one block is repeated over and over again, like for encryption or Bitcoin.

“A common request is a 64 x 64 bit multiplier,” Tate said. “If you write the Verilog and you place-and-route it, you’ll get a certain level of performance. If an engineering team says, ‘I’m going to use 256 copies of the 64 x 64 multiply,’ we can create a macro, place it, and route it so that it uses less silicon area. Everything’s closer together, and we force this block to always be done a certain way. That can take significantly less area and run at somewhat higher speed. It is something that’s done on a customer-by-customer basis if they identify a block that’s used repeatedly. It’s the equivalent of writing an assembler code subroutine in a C program. You don’t want to write assembler code if you don’t have to, but if it’s something that has a huge impact on performance, it might be worth the investment.”

Partitioning
In general, the first step to optimizing FPGA/eFPGA performance is to figure out what works best where. Some things work better on a standard processor than an FPGA, while for others an FPGA is at least as good, if not better.

“You do this so when you look at the fabric, it fits very nicely for high data throughput, large parallelism, unrolling all the math functions, and doing everything in one clock cycle as opposed to a bunch of serial ones,” said Joe Mallett, senior product marketing manager at Synopsys. “When you look at the architecture, the first split that you have to make is to determine what part runs where, and you typically do that by what type of a workload. Is it something that can be easily put in the fabric and potentially run at a lower speed but much broader? Is it something that’s going to take advantage to the DSP functions very nicely? When you look at FPGAs, you typically see very large bandwidth, high DSP math functions, and memory-intensive. For example, if you’re working on something large like a MAC table inside of 100 Gigabit Ethernet, or video that uses a lot of line buffers with math processing right next to it, or radio applications where you’re doing a whole bunch of multiply accumulate, add, subtract kind of functions, and trying to figure out waveforms — those fit very nicely for FPGAs. Getting the performance out of it is yet another challenge, because you’re forever balancing how fast you’re going to run it versus how much area you’re going to take to do it in. If you want to slow it down and burn less power on each clock cycle, you may use more area, which unfortunately is using more power on the other side.”

Even though FPGAs have been commercially available since 1984, engineers are still fighting with basic constraints.

“This is what influences the most, and it’s the hardest for designers to get right,” Mallett said. “However, there are lots of things that we do in the tools to try and help with that. One of the first things is that, say, somebody dumps their RTL in and they run it through synthesis. They look at the report and see a whole bunch of 1 MHz clocks that aren’t constrained correctly because it just defaults to something that’s easily identifiable as, ‘You know, this isn’t right, that clock should be running at 100 MHz.’ It’s really easy to just dump everything into the tools, run it through synthesis, evaluate what comes out the other end, and look at the log files and see what happens.”

FPGA synthesis tool providers have additional modes and features to help get through errors very quickly find all the ‘I wrote the RTL wrong’ errors, or ‘I missed a semicolon’ or all the language-type errors, along with getting through constraint-based errors as well, he said.

FPGA complexity also shows up in the number of IP blocks being used. “Over the last 10 or 15 years, the number of IP blocks has grown from 10 or 20, to 100 or 150. That brings the ability for each one of those blocks to potentially work in different non-synchronous clock domains, and that brings a level of complexity. Then, all the different hardened interfaces that you have to deal with — that has increased the complexity, as well. The sheer size is the most obvious one, because the larger it is, the more you can pack in there. All of those add up to a level of complexity that bring challenges to the designer, and the tools have evolved over time to help with those,” Mallett said.

FPGAs on the edge
One of the real attractions for FPGAs is that they can be used in applications where the technology and markets are still immature. Being able to program functionality in the field in hardware is better than having to write a series of software patches to an ASIC, and while there is a performance and power overhead in FPGAs compared with an ASIC, there also is huge value in being able to adapt to last-minute changes in protocols and algorithms after a device is designed, debugged and manufactured.

This is particularly important with technologies such as 5G, assisted and autonomous driving, AI and anything at the edge.

Edge computation is going to be a big play,” said Robert Blake, president and CEO of Achronix. “The fundamentals are all there. We know what all the base building blocks are and can figure out how to efficiently move data around in whatever formats. But you need to pay attention to the memory hierarchy of how you move the data the least distance to get it to the computation. These are fundamentals to how to get more efficient computing. You used to think of this as, ‘The box is the most important.’ Now, it’s the system of systems that are interacting. The flexibility that is going to be required everywhere is going to be huge. This is a complete fundamental shift that’s happening fairly quietly.”

It also steers the market, at least for the foreseeable future, strongly in the direction of FPGAs and embedded FPGAs. The argument for eFPGAs is that they can be architected into an ASIC or some other complex eFPGA, adding programmability as a safeguard without sacrificing the performance or low power of an ASIC.

“Once you get to the concept of embedded FPGAs, the delivery of it monolithically or embedded is in my mind the packaging problem,” said Blake. “The piece that is crystal clear is that when you look at the cost of semiconductors—we built little ones, then medium sized ones, and then built big ones—the cost structure goes up. If I want to add an embedded FPGA to a chip, that’s great because it’s still in line with the cost. But if you want add that capability later in a design, it will cost significantly more.”

Changing methodologies
This is not a simple FPGA design of a test chip, though. So along with more complex designs, the methodologies being deployed to develop and debug these chips are changing significantly.

“In the past, it used to be that you might just get somebody that writes RTL, dumps it into the synthesis, puts it on the board, tests it, and that’s it,” said Synopsys’ Mallett. “Complexity is now to the point where it has to be simulated, and you have to think about debug before you put it on the chip. You also have to start thinking about some broader areas of methodology.”

Increasingly, FPGA designers are now adopting more ASIC-like methodologies where they run the designs through synthesis first, perform debug up front, maybe using some verification IP because they don’t know what the protocol should be, Mallett observed. “They’ll set up the test benches appropriately, then they’ll run it through the synthesis engine. They even may be doing some fault simulation or some fault injection if they’re doing high-reliability type applications. Then they debug while it’s running on the chip, as well as correlating that from the chip all the way back to RTL to help with the debug. So when they look at that, it brings in the simulator, synthesis, debug, analysis. These are things the ASIC and SoC guys have solved over the years, and continue to drive. The FPGA world is taking advantage of that now.”

This requires a mindset change for the FPGA world, however.

“Designers may worry about possible bugs in the implementation flow and they may be hesitant to enable all the optimizations, but these are needed to meet PPA goals,” said Sasa Stamenkovic, senior field applications engineer at OneSpin Solutions.

What can also be helpful is formal sequential equivalence checking of the source RTL design against the FPGA implementation to allay concerns about possible bugs in the implementation flow. “The RTL can be verified exhaustively against the post-synthesis netlist, the placed-and-routed netlist, and even the bitstream that programs the device,” Stamenkovic said. “With equivalence checking in place, the most aggressive FPGA optimizations can be deployed with full confidence, satisfying the most challenging design requirements. Equivalence checking can detect not only implementation errors, but also any hardware Trojans or other unexpected functionality inserted during the implementation process. This ability is critical to establish trust in FPGAs and eFPGAs used for security-critical applications such as autonomous vehicles, military/aerospace, and medical electronics.”

It also involves combining the know-how of both ASIC and FPGA teams, which were largely separate in the past.

“It used to be that the FPGA team was looked down upon by the ASIC team, which has more to do with the cost of failure than anything else,” said Mentor’s Clubb. “It’s not really that much different now, especially with the size of FPGAs. The size of FPGAs today are so huge that even 5 to 10 years ago that would have been a very large ASIC project. But they may not necessarily have the same mindset, especially on verification and the rigors of ASIC design. For example, in talking to one ASIC customer, they’re using two clock domains. One clock domain is half the frequency of the other. They invariably will tell you that that is a separate clock domain, and you need to have clock-domain crossing. On the other hand, an FPGA designer will just say, ‘I don’t need to bother with that. It will probably work.’ Ten years ago, when the clock networks were pretty rigid, there was a PLL to keep things in sync and you could probably get away with that. But then you spend a lot of time debugging stuff on a board and you wonder why it falls over Thursday morning.”

That’s where verification comes into play, according to Clubb. “We’ve seen a lot more FPGA companies start to adopt a much more rigorous verification methodology, including UVM, constrained random, really trying to make sure that they do simulate the heck out of the RTL, because the cost of debugging on a board is no longer just putting a scope across some pins and watching a waveform. It’s about bringing that more rigorous ASIC mindset. If it doesn’t work on the FPGA, you go debug it, you figure it out, you throw some probes in, you do some more place-and-route, and then you just blow a new bitstream. The perceived cost of failure is not your million-dollar masks or screaming about how you can do a metal ECO and save some money. It’s unnoticeable. More of the ASIC mindset needs to come into the FPGA world, because if the plan is to debug the design on the board, they’re doing it wrong.”

—Ed Sperling contributed to this report.



Leave a Reply


(Note: This name will be displayed publicly)