Supporting CPUs Plus FPGAs

Experts at the table, part 1: What the toolchain looks like today and the different mindsets within those flows.

popularity

While it has been possible to pair a CPU and FPGA for quite some time, two things have changed recently. First, the industry has reduced the latency of the connection between them and second, we now appear to have the killer app for this combination. Semiconductor Engineering sat down to discuss these changes and the state of the tool chain to support this combination, with Kent Orthner, system architect for Achronix, Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence, Ellie Burns, marketing director for HLS and low power and Gordon Allan, Questa product manager, both from Mentor Graphics. What follows are excerpts from the conversation.

SE: Until recently, the CPU and the FPGA have been considered as two different application areas with two different flows, two different teams and almost a different thought process. Now we are seeing many applications where these two devices are coming together. What kinds of applications are we seeing this happen in?

Orthner: We are seeing a tremendous amount of acceleration in this area. Consider Amazon cloud, and Microsoft with Azure. I am seeing a lot of people getting interested in convolutional neural networks (CNN). That comes up again and again. At the recent FPGA conference in Monterey, probably half the conference was about that. I see applications such as SQL acceleration for people doing database applications, where you run the SQL through an FPGA on the way to your hard drive and you can do all kinds of filtering. There is a lot of acceleration, a lot of PCI Express applications where people are struggling with latency and throughput over those links.

Allan: FPGA and ASIC technologies are overlapping increasingly in their design flows. There is less and less reliance on a technology-centric flow having one side for FPGAs and the traditional ASIC flow. The way we are approaching our markets is by looking at the end-market segments of the customer base. Some of them are using FPGA technology in new and interesting ways such as SQL, query acceleration, and high-frequency trading algorithm acceleration. We see these kinds of new rapid turn-around technologies as being desirable and a great fit for combining ASIC and FPGA flows.

Burns: We see that too. We see a lot of experimentation, people looking for very high performance, very low power. They will measure power on the CPU, a GPU, and they may need to move to an FPGA or an ASIC. Deep learning, machine vision – all of these kinds of things need to have improved performance per Joule, and are looking for the most effective way to do that. On the FPGA you have much lower power than a GPU or CPU, so you can ask if you need to go to an ASIC and what does that look like. We are seeing a drive to understand how to do software acceleration efficiently. From a C-based flow we are starting to see people ask, ‘How else are we possibly going to do that and to know ahead of time what the performance per Joule is going to be?’ CNN for machine learning — you cannot put liquid-cooled systems in your car. That is not going to work, so they have come up with lower-power solutions. The Joules consumed by an NVIDIA box is about 40 Watts.

Orthner: It really does come down to operations per Joule.

Burns: This is a key measurement. How many Gigaflops per Joule does this algorithm consume? How many does it need to take? And what is my budget?

Schirrmeister: There are two components. One is the notion of balancing the software performance with hardware performance. Power is a big driver for that. Another driver is latency and how fast can I get to a certain task. Can I outsource some of my algorithm into an FPGA for the purposes of acceleration? This has to work appropriately with the software. We come into these types of situations when people ask us to remotely access an FPGA prototype. They ask if the channel is fast enough to move my data, accelerate it, and get the results back. It depends on the application if that works. Things such as video algorithms tend to lend themselves very well to that power component. But other applications require low latency. They may be able to do a specific compute function 10X faster, but for applications such as Accelerated Financial Trading, it is all about latency. Another aspect is the notion of what happens within the chips.

SE: What about design problems for this combination?

Schirrmeister: The design problems surface when you put together an FPGA and the rest of the SoC, which essentially may be an array of processors. You have two issues. First, how do you design that itself, which is one area of focus for our tools. And second, if you have a monolithic thing all on the chip, how do you balance the function you put into the processor system versus what you put into the FPGA fabric? Can you do the IP in any specific device and do you have enough available LUTs, can I map my IP to accelerate what was in the Arm-based subsystem? How do I model this and how do I make the decision? At the highest level, it may be board/CPU, or all on a single chip. The thing I find fascinating is that way back in history there was a lot of research into hardware/software co-design, and it is now becoming reality because the hardware is flexible, the software is flexible. We are on the cusp of figuring out new design flows for the kinds of things we thought we could do in the ’90s. Now we have them on-chip.

Burns: On one hand, it is interesting to say hardware-software co-verification is all of the interesting stuff, but what we are seeing in order to truly do some sort of acceleration is that the toolchain has to change. We can’t have modeling, then separating the pieces and have them go down separate paths. The idea of Amazon cloud, of being able to instantly accelerate an algorithm that someone has in software – the toolchain has to be able to do this automatically. They need a way to implement that quickly. They need a way to verify it quickly, make sure it works, make sure it is safe.

Schirrmeister: It is a problem for the FPGA side when it is all integrated because you face different mindsets. Now everything is flexible. If I make a mistake it is fixable. If I make a mistake that goes into tapeout, that is a pivotal career move for an ASIC. But now I can fix it either in hardware or in software.

Burns: You can fix it but…

Schirrmeister: It changes the mindset. You don’t have such a huge burden hanging over you. You are not as methodologically clean anymore.

Allan: But we want to avoid going back into the days of lab-based verification.

Orthner: I hope people are not just mapping it and turning it on.

Allan: We are redefining the SoC. In the mid-’90s, we defined an SoC as processors plus peripherals on a chip. Then we redefined SoC to add the firmware as a component. The ability to change and upgrade also adds to complexity, so we included the firmware in our definition. In the late-’90s, we experimented with FPGA plus CPU on the same silicon, and we were ahead of our time. We failed because of not using the right FPGA, architecture, the industry not being prepared to adopt this as their time-to-market solution. Now we are in a place where we have mature FPGA technology either to embed the CPU into an FPGA or the FPGA into an ASIC coexisting with a CPU. We redefine the SoC to include that flexibility in the logic, and the toolchain has to become fit for purpose. The markets into which these devices are targeted have low power concerns, security, reliability, time to market. The toolchain has to deal with them.

Burns: They are changing all the time. You cannot fix the IPs in there anymore.

Orthner: One of the limitations with the previous way of working was the FPGA was separate on the board. You get locked into this coarse-grained decision-making process about what goes into each.

Burns: Because you have a fixed and known bandwidth and latency.

Orthner: The closer you can get to the FPGA core, like putting it on the same die, now you have really low latency and you can make fine grained decisions. You can take an algorithm that only takes a couple of microseconds and have that moved over into the FPGA for acceleration.

Burns: If I look at the latest devices from Xilinx and Intel, they are taking care of the latency and bandwidth and all of a sudden, it is practically nothing. It may not literally be nothing, but it is no longer a PCIe transaction anymore. It is very low latency and high bandwidth.

Orthner: It is no longer the problem.

Burns: Now the problem has become, ‘You have hardware that you don’t have a proper toolchain for.’

Schirrmeister: Based on experience, I want to believe what happened in the ’90s. I want to believe that we are getting there and that we have the clean flows where you verify everything ahead of time. When I look at what we experienced in some of those designs, they were virtual platforms. There was the ability to mix transaction-level models (TLMs) with the processor model. On the most recent FPGAs, that flow is foreign to the developers. It is an interesting organizational mindset difference.

Burns: For an FPGA designer, I am not sure they do that.

Schirrmeister: That is my point. If there were a flow that allowed them to assess the performance/power tradeoff much easier, they might adopt it. But then there is still the map-it-and-try or lab-based verification mentality, because it can be changed so easily. The value has to be immense for people to adopt that flow. There are certain markets where…

Orthner: … you don’t want to spend months in a verification cycle.

Schirrmeister: Exactly. You would do that if the cost and the volume would legitimize that investment, but it is probably not what the FPGA designers have worked on previously.

Related Stories
Supporting CPUs Plus FPGAs Part 2
Experts at the table, part 2: Who is the real user and how will they program this type of solution?
System-Level Verification Tackles New Role
Experts at the table, part two: Panelists discuss mixed requirements for different types of systems, model discontinuity and the needs for common stimulus and debug.
Software Modeling Goes Mainstream
More chipmakers turn to software-hardware interaction for performance, power, security.
CPU, GPU, Or FPGA?
Need a low-power device design? What type of processor should you choose?
Embedded FPGAs Come Of Age
These devices are gaining in popularity for more critical functions as chip and system designs become more heterogeneous.
Heterogeneous System Challenges Grow
How to make sure different kinds of processors will work in an SoC.



  • Maximilian Odendahl

    Burns: “Now the problem has become, ‘You have hardware that you don’t have a proper toolchain for.'”

    This is exactly the problem Silexica is tackling with its SLX Tool Chain, trading off performance vs power, taking both heterogeneous processors/FPGAs and communication architectures into account! Happy to dive into a discussion on our technology.

    • Ellie Burns

      Yes, that would be interesting.

  • Karl Stevens

    There were references to what the FPGA designers are doing with the implication that the tools need to contend with current design approaches — that is “Wrong” with a capital “W”.

    Designers have no choice but to use the pitiful tool chain that tends to throw everything away, start from scratch by synthesizing and placing every LUT, FF, and wire.

    The data flow and sequencing are critical and where most bugs exist, so the tool chain AND the designers must adopt an approach to implement a flexible data flow for both algorithms and sequencing.

    Everything does not happen “at once”, as is often said, or there would be no need for clocks and there is no good reason to define control logic using what looks like a programming language full of if/else and assignment statements.

    FPGAs use LUTS to evaluate Boolean Algebra for control and computation. Each LUT is a one bit wide memory and every cell is used whether it contains a ‘1’ or a ‘0’ so synthesis is not as critical because it is intended to minimize gates which do not exist in FPGAs.

    If it is necessary to execute C code, then put a C accelerator on the FPGA instead of trying to synthesize C code.

    If computation algorithms can be accelerated, then application code can also be accelerated w/o a CPU.