Experts at the table, part 2: Who is the real user and how will they program this type of solution?
While it has been possible to pair a CPU and FPGA for quite some time, two things have changed recently. First, the industry has reduced the latency of the connection between them and second, we now appear to have the killer app for this combination. Semiconductor Engineering sat down to discuss these changes and the state of the tool chain to support this combination, with Kent Orthner, system architect for Achronix, Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence, Ellie Burns, marketing director for HLS and low power and Gordon Allan, Questa product manager, both from Mentor Graphics. Part one can be found here. What follows are excerpts from the conversation.
Burns: We are seeing the desire to take a software algorithm, with nirvana being able to take C software. This is what the industry wants. If we have the low latency and high bandwidth and we know what it is and we know the architecture, we stand a much better chance with the toolchain we have today, to be able to take that C algorithm and move it onto the FPGA and verify it.
SE: So the ultimate user is a software person. High-level synthesis (HLS) was never intended for this type of user.
Burns: This is what they want.
Allan: In the high-frequency trading market, you can imagine a guy sitting in Korea and he is the brains behind the operation except he is not coding software, he is coding the next RTL or HLS algorithm that is going to be downloaded into the chip half way around the world.
Schirrmeister: The question becomes is it really RTL? What are these guys using? What about OpenCL?
Burns: There are languages popping up like daisies.
Schirrmeister: If I look at programming models, there was the multicore crisis in 2008. Microsoft wrote a paper about the free lunch being over and the world was coming to an end. CPUs were no longer getting faster, we have a power problem and now the whole deck of cards will break and nobody will be able to get designs out anymore because we can’t parallelize. Unless that happens in the next five minutes, it hasn’t happened, and we are 10 years on from there. So how have we worked around it? It is not that everyone is using OpenCL or alternatives, although that is certainly a trend. It is moving in that direction. But there are many programming models. There was Cuda, Open MP… There are perhaps 50 of them.
Burns: Even Google has a higher-level language that maps down into OpenCL and Cuda.
Schirrmeister: I always joke that I don’t know what to teach my daughter except Scratch or some of these very high-level languages. Nobody wants to deal with hardware anymore. But OpenCL is a trend and I am looking at Amazon Web Services (AWS) and the preview blogs I have read say you log-in and you attach to a host that has an FPGA and here is Vivado and …
Burns: Did you see the comments on that article? Great hardware stuff, great hardware stuff, and this big section saying they still ask you to do RTL. This is a problem.
Orthner: And those solutions still connect the FPGA over PCI Express. You have all of the driver stacks to deal with?
Schirrmeister: Is it really RTL or OpenCL. There is certainly a higher level of description that is needed.
Orthner: OpenCL is just a language. HLS is a way of taking OpenCL, of taking ANSI C, of taking SystemC – all of these high level languages and putting it in the FPGA.
Burns: Yes you need HLS to be able to take advantage of this. The technology has to mature into something.
Orthner: RTL is not going away. We will see a lot more HLS. We will see a lot more libraries of components that will snap together with standard interfaces. Xilinx went to an AXI approach a few years ago.
Burns: HLS can understand those two things. If you fix the architecture, its job has become a lot easier.
Orthner: And if you look at how much HLS compilers work, under the hood they are using a library of little components that snap together. The more standard interfaces we can get, the more we can get to something like a standard library (STL) that everyone can use for logic that just snaps together. It is sad that every RTL developer out there seems to…
Burns: Not just snapping together of big things, we have seen some HLS users just try to instantiate things and that doesn’t give you the flexibility of scheduling or pipelining. It takes about half the power of what you need in HLS, which is to be able to take A+ B + C and have the tool figure, it out. You need to be at a higher level and this is the mindset of the software guy. I have an algorithm, and some users in automotive and image processing, will make templates. The algorithm goes on the outside, the architecture is in the template and because they know what it is they just start generating all of it and they can generate the IP very quickly. As we start to understand more of the architectures, I hope that HLS can evolve.
Allan: At one end there are CPUs and instruction sets and the FPGA could be on a coprocessor accelerator interface as opposed to being on the processor bus. And you could take that in a direction and have smart software compilers that are able to recognize patterns and retarget them into newly invented instructions using some flow to accelerate traditional software. At the other extreme, you have a more structural approach which for IoT, when we have a billion devices connected to 5G networks, we don’t need a billion different platforms. We need a small number of platforms and many variants. I see part of the toolchain being the platform with CPUs preconfigured with kernels for low power and security preconfigured that are able to deal with things in common with every IoT device. Being able to update themselves on demand without interrupting what they are doing, being able to be implicitly secure and being able to operate in the lowest power range possible. All of that requires structural design techniques and deciding to leave the piece of the jigsaw puzzle that is fluid, which could be filled by HLS generating some logic.
Orthner: Back to the verification problem. The tools that you describe – when you can get to the point where you write code and it gets accelerated for you when placed on the FPGA, you were talking about how in the verification space—FPGA people tend to just write the code and then put it on the chip and debug in the lab. This moves us more and more into being able to treat software like unit test. I can write my algorithm and I don’t have to get too involved with how it is mapped to the FPGA and I can write all of my software-level unit tests for verification. That would be so much faster than it is today.
Burns: This is what we are seeing and what people want. They want to be able to write the algorithm and test it and this is why C to RTL equivalence checking is important – generate the RTL, make sure it is the same, get it onto the device and rerun the tests. So we are seeing this as being important. The FPGA guys are not going to take up, as hard as we have tried, they are not going to take up an ASIC verification flow. They have not embraced it and the software guys are never going to do it.
Schirrmeister: Exactly. It is actually worse than I thought. The mindsets you have to take into account are the FPGA hardware mindset, the software mindsets which are different and you try to give them a methodologically clean approach to things and you have to provide tremendous value for them to adopt it.
Burns: As an industry, we have to automate a lot of this to make it a reality. Google has their Tensorflow. That is yet another flow, a language that sits above OpenCL, Cuda, etc. Is that where we are heading?
Schirrmeister: The key is that at the end of the day, there is always an attempt to lock you into a revenue stream of sorts. Cuda locks you into NVIDIA. You have the same for OpenCL which while open to a specific intent is just like Verilog and VHDL, there are different levels of support which may lock you into which FPGA you are using.
Allan: AXI is open but it moves everything toward ARM.
Schirrmeister: The one who wins is the one with the smartest hardware abstraction. The one with the most efficient mapping from the higher levels into hardware. The other interesting thing with HLS is that along with new hardware/software co-verification, you also get system-level design and performance analysis capabilities that you were never able to do before. The amount of flexibility you now have to explore the system space and drive that into HLS – in the past you may have had two points, now you have a full Pareto curve. I do not know how this extends to CPU with an attached FPGA. There it is all about the interfaces.
Burns: Yes, you have the bandwidth and latency limitations.
Schirrmeister: How much data can I pump between them and how do I efficiently share through memory. Do I make them coherent caches between the FPGA and CPU?
Orthner: This is central. If you are dealing with two devices it is very difficult. We are working on Cache Coherent Interconnect for Accelerators (CCIX) to alleviate some of that, but it is still going to be slow.
Supporting CPUs Plus FPGAs (Part 1)
What the toolchain looks like today and the different mindsets within those flows.
Supporting CPU Plus FPGA (Part 3)
Partitioning, security issues, verification and field upgradeability.
Embedded FPGAs Come Of Age
These devices are gaining in popularity for more critical functions as chip and system designs become more heterogeneous.
CPU, GPU, Or FPGA?
Need a low-power device design? What type of processor should you choose?
FPGA Prototyping Gains Ground
The popular design methodology enables more sophisticated hardware/software verification before first silicon becomes available