Integrating 16nm FPGA Into 28/22nm SoC Without Losing Speed Or Flexibility

Implementing programmable FIR filters by hardening the data path but keeping the control path in eFPGA.

popularity

Systems companies like FPGA because it gives parallel processing performance that can outdo processors for many workloads and because it can be reconfigured when standards, algorithms, protocols or customer requirements change.

But FPGAs are big, burn a lot of power and are expensive. Customers would like to integrate them into their adjacent SoC if possible.

Dozens of customers are now using eFPGA to integrate large FPGA into their 16nm, 12nm and 7nm SoCs. More than 20 customers have working silicon. And the largest eFPGA delivered so far is 240K LUTs and the fastest runs at 500MHz over worst case conditions from -40C to +125C.

Many chip companies are also using eFPGA to add flexibility for changing requirements.

If you want 16nm FPGA performance it’s easy if you have a 16nm SoC.

But how can a 28/22nm SoC integrate 16nm FPGA and keep the 16nm speed?

A simple example to illustrate a solution

Let’s suppose the FPGA is implementing a programmable FIR filter such as this simple example:

Fig. 1: Programmable FIR filter.

The data path is programmable to follow one of three paths: a) 20+40+60 = 120 taps, b) 20+60 = 80 taps, c) 60 taps.

Frequency of FIR filters of increasing size in 28nm and 16nm FPGA

Digital filter algorithms are primarily composed of multipliers, adders, and registers. The basic structure of a Finite Impulse Response (FIR) filter is shown in figure 2. There are two FIR filter architectures to chose from: direct form and transposed form. The transposed-form architecture was chosen to best fit the DSP architecture of the embedded FPGA. In a transposed-form FIR, data samples are applied in parallel to all the tap multipliers through pipeline registers. The products are applied to a cascaded chain of registered adders, combining the effect of accumulators and registers. The tap coefficients were programmed into the FPGA fabric directly to minimize FPGA resources (i.e. registers and LUTs), instead of using registers to store the coefficients.

Fig. 2: N-TAP FIR filter transposed form architecture.

FPGAs have large numbers of Multiplier/Accumulators (MACs) that can be used to implement FIR filters.

FPGAs have programmable interconnects to connect between logic blocks such as MACs and LUTs. In most FPGAs there are also direct hardwired connections between adjacent MACs for up to 5, 10 or even 20 MACs in a row. But as the FIR filter grows larger eventually the programmable interconnect needs to be used to connect the MAC chains.

Fig. 3: (a) Interconnect-connected MACs and (b) direct-connected MACs.

The table above shows the big performance difference between hardwired MAC connections (5-tap FIR Filter) and programmable interconnect connections (21-tap and 40-tap FIR Filters).

Especially notice that in 28nm, the hardwired FIR filter datapath frequency exceeds significantly the frequency for the larger FIR filters in 16nm.

Harden the data path, keep the control path in eFPGA

In many FPGA implementations the data path is relatively constant, it is mostly the control path that is being updated for changing algorithms, standards and customer needs. And a large FPGA tends to be limited by the longest connections across the design. So shifting to a smaller FPGA can boost performance.

So if a customer designing a 28nm SoC wants to integrate a 16nm FPGA without lower frequency, a solution is to harden the data path or most of the data path. For example in the programmable FIR filter example in figure 1, the data path could be hardened into MAC chains of 10 or 20 MACs with selectable alternate data paths between the MAC chains.

The the control path can be implemented in 28nm eFPGA: our data shows that with carefully written 3 or 4 stage RTL, 400MHz performance is achievable for smaller blocks of eFPGA that can then use their outputs to control the hardwired data path functions.

With EFLX eFPGA another option is that the reconfiguration of eFPGA doesn’t have to be done with Flash as with existing FPGAs. If desirable, configuration bits for the FPGA can be brought from DRAM in milliseconds or from local SRAM in microseconds to reconfigure as quickly as desired.

A real life example, built in silicon

Flex Logix has built an AI accelerator called InferX using InferX compute tiles as shown below. InferX is running at 533MHz in TSMC 16nm for models like Yolov3, v4, v5 and ResNet-50.

The InferX compute tile hardwires 16 1-dimensional tensor processors (64 INT-8 MACs with 64×64 INT8 weight matrix; it can also operate in INT16 mode) which are connected by a fast programmable interconnect ring to configure for various neural network operations. All of this is controlled by a central core of eFPGA which implements the control logic for configuring and running the InferX compute tile to excute the desired neural network operator. Once done, the tile is reconfigured with the optimal connections between the 16 TPUs for the next operator.

Fig. 4: InferX compute tile.

Conclusion

For finFET SoCs, fast FPGAs can be integrated without changing the FPGA Verilog RTL.

For 28nm SoCs, fast FPGAs can be integrated by hardening the data paths with some programmable connections and using eFPGA for control path with carefully written 3 or 4 layer Verilog RTL. Flex Logix has done this for a complex design and can help our customers with theirs.



Leave a Reply


(Note: This name will be displayed publicly)