To get top performance from neural network inferencing, you need lots of MACs/second.
FPGA chips are in use in many AI applications today, including Cloud datacenters.
Embedded FPGA (eFPGA) is now becoming used for AI applications as well. Our first public customer doing AI with EFLX eFPGA is Harvard University, who will present a paper at Hot Chips August 20th on Edge AI processing using EFLX: “A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices.”
We have other customers whose first question is, “how many GigaMACs/second can you execute per square millimeter”?
FPGAs are used today in AI because they have a lot of MACs. (See later in this blog for why MACs/second are important for AI).
The EFLX4K DSP core turns out to have as many or, generally, more DSP MACs per square millimeter relative to LUTs than other eFPGA and FPGA offerings, but the MAC was designed for digital signal processing and is overkill for AI requirements. AI doesn’t need a 22×22 multiplier and doesn’t need pre-adders or some of the other logic in the DSP MAC.
But we can do much better by optimizing eFPGA for AI: replace the signal-processing-oriented DSP MACs with AI-optimized MACs of 8×8 with accumulators, optionally configurable as 16×16 or 16×8 or 8×16 as required, and allocate more of the area of the eFPGA core for MACs.
The result is the EFLX4K AI core, which has >10x the GigaMACs/second per square millimeter of any existing eFPGA or FPGA.
The EFLX4K AI core has 8-bit MACs (8×8 multipliers with accumulators) which can also be configured as 16-bit MACs, 16×8 MACs or 8×16 MACs as required, reconfigurable. Each core has 441 8-bit MACs which can run ~1Ghz at worst case conditions (125C Tj, 0.72Vj, slow-slow corner) for ~441 GMacs/second for each EFLX core. This compares to 40 MACs at ~700MHz at worst case conditions for the EFLX4 DPS core which is 28GMacs/second.
Why are MACs critical for AI?
Below is a very simple neural network graph. The input layer is what the neural network will process. For example, if the input layer were a 1024×768 picture, there would be 1024×768 = 786,432 inputs each with an R, G and B component! The output layer is the result of the neural network: perhaps the neural network is set up to recognize a dog versus a cat versus a car versus a truck. The hidden layers are the steps required to go from the raw input to achieve a high confidence output: typically there are many more layers than this.
A neural network is an approximation of the neurons in a human brain receiving inputs from dozens or hundreds of other neurons to generate their own output. In the example above, the first hidden layer has 7 “neurons” receiving an input from each of the 5 inputs of the input layer. Shown in red above are the inputs received by the top neuron of the first hidden layer.
Mathematically, the inputs are multiplied by a unique weight (weights are computed in an earlier training phase), then summed, then “activated” to produce the value of the neuron.
What you can see is that multiply-accumulates in the form of matrix multiplies is the bulk of the math required in computing a neural network.
In a practical neural network, the matrices have millions of entries and it is not practical to have a hardware matrix multiplier that big. Instead, the matrix multiply is broken down into blocks that “fit” in the hardware available.
So processing neural networks fast is achieved by having the biggest MAC array you can afford and having it run at the highest frequency you can achieve.
The reason to have a reconfigurable MAC array is because the algorithms in neural networks are evolving rapidly, so a hard-wired solution is likely to become obsolete earlier than one which is reconfigurable.
The EFLX4K AI core is programmable with Verilog or VHDL today. In the future, Tensorflow and/or Caffe to eFPGA programming will be available as well.
Like all EFLX cores, the EFLX4K AI can be tiled into large arrays up to 7×7 (which would have >20 TeraMACs/second at worst case operating conditions). EFLX4K AI cores can also be mixed in arrays with the other EFLX4K cores, Logic and DSP.
See www.flex-logix.com/eflx4k-ai to download a target spec for the EFLX4K AI core. The EFLX4K AI core can be implemented in any CMOS process in about 6-8 months.
Leave a Reply