Supporting CPUs Plus FPGAs

Experts at the table, part 2: Who is the real user and how will they program this type of solution?


While it has been possible to pair a CPU and FPGA for quite some time, two things have changed recently. First, the industry has reduced the latency of the connection between them and second, we now appear to have the killer app for this combination. Semiconductor Engineering sat down to discuss these changes and the state of the tool chain to support this combination, with Kent Orthner, system architect for Achronix, Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence, Ellie Burns, marketing director for HLS and low power and Gordon Allan, Questa product manager, both from Mentor Graphics. Part one can be found here. What follows are excerpts from the conversation.

SE: How are people going to program this type of system? Will they use RTL, SystemC or some other programming language?

Burns: We are seeing the desire to take a software algorithm, with nirvana being able to take C software. This is what the industry wants. If we have the low latency and high bandwidth and we know what it is and we know the architecture, we stand a much better chance with the toolchain we have today, to be able to take that C algorithm and move it onto the FPGA and verify it.

SE: So the ultimate user is a software person. High-level synthesis (HLS) was never intended for this type of user.

Burns: This is what they want.

Allan: In the high-frequency trading market, you can imagine a guy sitting in Korea and he is the brains behind the operation except he is not coding software, he is coding the next RTL or HLS algorithm that is going to be downloaded into the chip half way around the world.

Schirrmeister: The question becomes is it really RTL? What are these guys using? What about OpenCL?

Burns: There are languages popping up like daisies.

Schirrmeister: If I look at programming models, there was the multicore crisis in 2008. Microsoft wrote a paper about the free lunch being over and the world was coming to an end. CPUs were no longer getting faster, we have a power problem and now the whole deck of cards will break and nobody will be able to get designs out anymore because we can’t parallelize. Unless that happens in the next five minutes, it hasn’t happened, and we are 10 years on from there. So how have we worked around it? It is not that everyone is using OpenCL or alternatives, although that is certainly a trend. It is moving in that direction. But there are many programming models. There was Cuda, Open MP… There are perhaps 50 of them.

Burns: Even Google has a higher-level language that maps down into OpenCL and Cuda.

Schirrmeister: I always joke that I don’t know what to teach my daughter except Scratch or some of these very high-level languages. Nobody wants to deal with hardware anymore. But OpenCL is a trend and I am looking at Amazon Web Services (AWS) and the preview blogs I have read say you log-in and you attach to a host that has an FPGA and here is Vivado and …
Burns: Did you see the comments on that article? Great hardware stuff, great hardware stuff, and this big section saying they still ask you to do RTL. This is a problem.

Orthner: And those solutions still connect the FPGA over PCI Express. You have all of the driver stacks to deal with?

Schirrmeister: Is it really RTL or OpenCL. There is certainly a higher level of description that is needed.

Orthner: OpenCL is just a language. HLS is a way of taking OpenCL, of taking ANSI C, of taking SystemC – all of these high level languages and putting it in the FPGA.

Burns: Yes you need HLS to be able to take advantage of this. The technology has to mature into something.

Orthner: RTL is not going away. We will see a lot more HLS. We will see a lot more libraries of components that will snap together with standard interfaces. Xilinx went to an AXI approach a few years ago.

Burns: HLS can understand those two things. If you fix the architecture, its job has become a lot easier.

Orthner: And if you look at how much HLS compilers work, under the hood they are using a library of little components that snap together. The more standard interfaces we can get, the more we can get to something like a standard library (STL) that everyone can use for logic that just snaps together. It is sad that every RTL developer out there seems to…

Burns: Not just snapping together of big things, we have seen some HLS users just try to instantiate things and that doesn’t give you the flexibility of scheduling or pipelining. It takes about half the power of what you need in HLS, which is to be able to take A+ B + C and have the tool figure, it out. You need to be at a higher level and this is the mindset of the software guy. I have an algorithm, and some users in automotive and image processing, will make templates. The algorithm goes on the outside, the architecture is in the template and because they know what it is they just start generating all of it and they can generate the IP very quickly. As we start to understand more of the architectures, I hope that HLS can evolve.

Allan: At one end there are CPUs and instruction sets and the FPGA could be on a coprocessor accelerator interface as opposed to being on the processor bus. And you could take that in a direction and have smart software compilers that are able to recognize patterns and retarget them into newly invented instructions using some flow to accelerate traditional software. At the other extreme, you have a more structural approach which for IoT, when we have a billion devices connected to 5G networks, we don’t need a billion different platforms. We need a small number of platforms and many variants. I see part of the toolchain being the platform with CPUs preconfigured with kernels for low power and security preconfigured that are able to deal with things in common with every IoT device. Being able to update themselves on demand without interrupting what they are doing, being able to be implicitly secure and being able to operate in the lowest power range possible. All of that requires structural design techniques and deciding to leave the piece of the jigsaw puzzle that is fluid, which could be filled by HLS generating some logic.

Orthner: Back to the verification problem. The tools that you describe – when you can get to the point where you write code and it gets accelerated for you when placed on the FPGA, you were talking about how in the verification space—FPGA people tend to just write the code and then put it on the chip and debug in the lab. This moves us more and more into being able to treat software like unit test. I can write my algorithm and I don’t have to get too involved with how it is mapped to the FPGA and I can write all of my software-level unit tests for verification. That would be so much faster than it is today.

Burns: This is what we are seeing and what people want. They want to be able to write the algorithm and test it and this is why C to RTL equivalence checking is important – generate the RTL, make sure it is the same, get it onto the device and rerun the tests. So we are seeing this as being important. The FPGA guys are not going to take up, as hard as we have tried, they are not going to take up an ASIC verification flow. They have not embraced it and the software guys are never going to do it.

Schirrmeister: Exactly. It is actually worse than I thought. The mindsets you have to take into account are the FPGA hardware mindset, the software mindsets which are different and you try to give them a methodologically clean approach to things and you have to provide tremendous value for them to adopt it.

Burns: As an industry, we have to automate a lot of this to make it a reality. Google has their Tensorflow. That is yet another flow, a language that sits above OpenCL, Cuda, etc. Is that where we are heading?

Schirrmeister: The key is that at the end of the day, there is always an attempt to lock you into a revenue stream of sorts. Cuda locks you into NVIDIA. You have the same for OpenCL which while open to a specific intent is just like Verilog and VHDL, there are different levels of support which may lock you into which FPGA you are using.

Allan: AXI is open but it moves everything toward ARM.

Schirrmeister: The one who wins is the one with the smartest hardware abstraction. The one with the most efficient mapping from the higher levels into hardware. The other interesting thing with HLS is that along with new hardware/software co-verification, you also get system-level design and performance analysis capabilities that you were never able to do before. The amount of flexibility you now have to explore the system space and drive that into HLS – in the past you may have had two points, now you have a full Pareto curve. I do not know how this extends to CPU with an attached FPGA. There it is all about the interfaces.

Burns: Yes, you have the bandwidth and latency limitations.

Schirrmeister: How much data can I pump between them and how do I efficiently share through memory. Do I make them coherent caches between the FPGA and CPU?

Orthner: This is central. If you are dealing with two devices it is very difficult. We are working on Cache Coherent Interconnect for Accelerators (CCIX) to alleviate some of that, but it is still going to be slow.

Related Stories
Supporting CPUs Plus FPGAs (Part 1)
What the toolchain looks like today and the different mindsets within those flows.
Supporting CPU Plus FPGA (Part 3)
Partitioning, security issues, verification and field upgradeability.
Embedded FPGAs Come Of Age
These devices are gaining in popularity for more critical functions as chip and system designs become more heterogeneous.
Need a low-power device design? What type of processor should you choose?
FPGA Prototyping Gains Ground
The popular design methodology enables more sophisticated hardware/software verification before first silicon becomes available

  • snehasish

    I’d like to bring to your notice a paper* published at High Performance Computer Architecture (HPCA) 2017 which outlines a methodology to extract frequently executed code paths to target a CPU-FPGA heterogeneous system [1]. In this work, a fully automated tool chain is described which uses compiler aided profiling and analyses to partition arbitrary C/C++ programs into binaries which can be deployed to CPU-FPGA SoCs such as the Altera Cyclone 5. The tool chain uses the LLVM compiler framework and is available as open source software [2].

    * — Disclosure: I am the first author.


  • Lars Asplund

    “The FPGA guys are not going to take up, as hard as we have tried, they are not going to take up an ASIC verification flow. They have not embraced it and the software guys are never going to do it.”

    What’s your analysis of this? Is it the FPGA people not knowing what’s good for them? Is it something wrong with the ASIC flow? Is the ASIC flow ill suited for FPGA development?

    • Karl Stevens

      @Lars: I think FPGA designs have been smaller and creating a new configuration bit stream to do fixes or changes was pretty quick and easy. The incentive to do exhaustive verification is not as great as with ASIC.

    • soyAnarchisto

      The mom and pops have not embraced it – which is the majority of designers – but serious players certainly have. And if you are a platform developer for cloud infrastructure – you certainly are – and you are certainly using IP which has undergone every bit as rigorous a verification path as in any ASIC. This Panel is missing some serious expertise from the FPGA vendors they are discussing. Vivado brought ASIC-class tool flows to FPGA designers – but the ASIC designers still have jobs – until they embrace HLS you will not see large shifts in methodology. The initial Amazon AWS flows use RTL – but anybody can see that it will not be this way for very long.

    • Lars Asplund

      @disqus_qZOOSRsiqY:disqus @soyAnarchisto:disqus

      There is certainly room for improvements in the verification area but what I see among those striving to improve and those that already have “matured” is that they are looking more at software approaches. Modern software verification is very focused on a short feedback cycle. Once you think a piece of code should work you want to test that in seconds or minutes. Remove bugs, design flaws and misunderstandings as soon as possible to avoid further harm. This implies that

      – Verification starts with the designer who’s capable of finding most bugs. Verification at higher levels is needed but it also kills the short feedback cycle so the majority of bugs must be found by the designer/team. Creating a separate expert team to do “all” verification is not an option with this approach.
      – Testing must be highly automated regardless if it’s done in a simulated environment or on target hardware. At some point simulations are too slow and it’s more efficient to build the FPGA and run full speed testing on the target hardware. The ability to continuously build and test on the target hardware is a MAJOR difference between the FPGA and ASIC flows. The FPGA flow is not driven by the fear of losing ones job.

      This approach is successfully used by companies producing complex, high-volume, high-availability, and safety related products. They are players very serious about verification.

      Where I work we run both simulated and target test suites many times a day on newly developed code. We write the software-based target tests ourselves but get the reference input data and expected output from the application specialists. Very few bugs leaves our room and the feedback cycle is very short.

  • Karl Stevens

    “You need to be at a higher level and this is the mindset of the software guy. I have an algorithm, and some users in automotive and image processing, will make templates. The algorithm goes on the outside, the architecture is in the template and because they know what it is they just start generating all of it and they can generate the IP very quickly. ”

    And it is the mindset of the system engineer/architect (me for 30 years).
    There are 3 fundamental things
    1) Which component/module of the system will do the function.
    2) What is to be done in terms of functional operations.
    3) The location of the data if it is not local.
    “Address”, “Command”, “Data” in general terms.

    Systems are modular, both HW IP and SW IP. Object Oriented Programming is totally modular and classes can
    function as either. (HW vs SW trade-offs).

    Verilog came about because it was pure magic that HDL could be simulated! Then because Verilog could be SIMULATED the next step was to BUILD hardware using Verilog, Folks the purpose of the HDL that existed BEFORE Verilog was to build hardware! (circular thinking)

    And HW designers were forced to use Verilog for design entry. And the tool chains were created. And the Verilog language was created by non professional programmers who thought if the syntax looked like C everything would be wonderful.

    If anyone is serious, take a look at the Roslyn/Mono/C#
    Syntax API that exposes the conditions, types(variables), and sequences for the C language. Oh, and the compiler does NOT synthesize away things like the non helpful Verilog compiler. I can put things in as the design evolves and use them later. Certainly at build time unused resources should be “synthesized away”, but not during design entry.

    The IDE actually has a Debugger and Editor that are actually helpful during design entry.

    “How much data can I pump between them and how do I efficiently share through memory. Do I make them coherent caches between the FPGA and CPU?”

    Heterogeneous systems are the answer. Cache and shared memory don’t cut it. If the FPGA is receiving data do the DSP then put the processed data block in memory, etc.