Basics Of Embedded FPGA Acceleration

Why programmability is essential for speeding up SoC performance.


Making a chip run faster is no longer guaranteed by shrinking features or moving to a different manufacturing process. It now requires a fundamental change in the architecture of the chip itself.

The days of the single-processor, or even single multi-core processors, are gone. The focus has shifted to different kinds of processors for different kinds of data and many different protocols and I/O schemes. Some of these are still being developed.

Embedded FPGAs are a new way of adding speed with flexibility. In the past, a common strategy was to embed a processor within an FPGA. It turns out the reverse is a much more efficient solution to the problem, embedding the FPGA within a chip that can both accelerate a variety of operations without facing obsolescence by the time the chip reaches production. In a variety of new markets such as machine learning, AI and automotive, programmability is the best approach.

So at the most basic level, what are the benefits of an eFPGA? And what kinds of issues do design teams need to consider?

As Achronix System Architect Kent Orthner pointed out in this video, when looking to accelerate functionality using a discrete FPGA chip, the biggest issue that designers face is latency. The most fundamental limiting factor is the communication between a CPU and an FPGA across a relatively narrow and slow PCIe interface. Even products that are advertising as having low-latency interconnects have latencies that are in excess of 1 μs. Real-world applications (for example, accelerating a Linux application), in fact, incur around 15 μs of latency. Discrete systems also incur substantial duplication in terms of having to write to, and read from memory, and typically transfer data between two sets of DDR memory as part of the process.

These factors limit the degree of acceleration that can be achieved with a discrete FPGA. For a fairly typical algorithm, it could easily take up to 25 ms (of latency) for simple transactions between a CPU and an FPGA. Batching large number of operations could help in reducing this latency, but only by a factor of 2.5. With an eFPGA inside an SoC, the ability to share the DDR memory and cache hierarchy vastly accelerates data movement from the CPU to the eFPGA and back again to about 10 ns, which is about 2,500,000 times faster.

An eFPGA can vastly improve performance when it comes to FPGA configuration. Whereas a discrete FPGA might use a serial interface to an EPROM or an 8-bit wide processor interface, an on-chip Achronix eFPGA can be connected via a 128-bit wide AXI interface, running at on-chip interconnect frequencies. This high-speed connectivity results in a better than a 16× improvement in configuration time when compared to an 8-bit interface running at 100 Mhz, or 128× when compared to a serial interface. As a result, an eFPGA with 100,000 lookup tables can be configured in under 2 ms.

This same potential for an incredible richness of pin interfaces within an eFPGA means that it is relatively easy to run multiple logical accelerators inside an eFPGA fabric in parallel, each with its own 128-bit wide AXI interface. In a relatively large eFPGA, with eight 128-bit AXI interfaces running at 1 Ghz, you have a solution with the potential for an incredible 1-terabit-per-second data transfer.

By using an eFPGA, companies have the potential to see:

  • >100× reduction in latency for real-world applications.
  • >10× improvement in throughput.
  • >2× reduction in power usage.
  • >4× reduction in area.

I’m sure you’ll agree these are all very impressive figures. If you’re intersted in understanding more about what Speedcore IP can do for you, make sure to click on the video.

Leave a Reply