Using FPGAs For AI

How good are standard FPGAs for AI purposes, and how different will dedicated FPGA-based devices be from them?


Artificial intelligence (AI) and machine learning (ML) are progressing at a rate that is outstripping Moore’s Law. In fact, they now are evolving faster than silicon can be designed.

The industry is looking at all possibilities to provide devices that have the necessary accuracy and performance, as well as a power budget that can be sustained. FPGAs are promising, but they also have some significant problems that must be overcome.

The graphics processing unit (GPU) made machine learning (ML) possible. It provided significantly more compute power and had a faster connection to memory than the CPU. Data centers rapidly incorporated them into their offerings, and GPU vendors developed software to help their hardware be used effectively.

“The applicability of GPUs has been a success story in the development of ML,” says Mike Fitton, senior director for strategy and planning at Achronix. “They have been the engine driving initial developments and deployments. This is explained by a number of factors, including high floating-point performance and ease of development with a robust, high-level tools ecosystem.”

But GPUs are very power-hungry devices, and this is just as big an issue for datacenters as it is for edge devices.

AI algorithms have been growing in size and complexity, and GPU development has not been able to keep pace. “GPUs perform well on highly regular SIMD processing,” adds Fitton. “An alternative is the FPGA, which is inherently parallel and hardware-programmable, and these devices excel at specialized workloads that need massive parallelism in compute operations.”

FPGAs add a significant increase in the number of parallel computational elements that could be put into more optimal configurations. They have small amounts of distributed memory incorporated into the fabric, bringing memory closer to the processing. “FPGAs started to gain traction in that space because they can do more processing than the GPU,” says Joe Mallet, senior product marketing manager at Synopsys. “More importantly, they could do it in a quarter of the power budget. So for the first time in its life, the FPGA is now considered to be the low-power solution.”

The most important aspect of FPGAs is their flexibility. “This is manifested in two ways,” says Dave Lidrbauch, strategic marketing manager for Mentor, a Siemens Business. “First, they contain a lot of different kinds of resources, including hard cores, IP, memory, and the LUT structure of the programmable fabric lends itself well to a lot of neural network (NN) architectures. Second, the ability to field reprogram the entire chip, or parts of the chip, has gotten better over time.”

In addition, FPGAs are scalable. “If you have an algorithm that is bigger than what the GPU can handle or fits into the FPGA, you can daisy-chain FPGAs together, as well,” adds Synopsys’ Mallet. “So you can scale to very large algorithms or processing jobs. You could put one, four, or even eight or more FPGAs on a daughter card or per slot in a server and be able to scale that way.”

However, FPGAs have suffered because they do not provide a natural programming model that software developers are used to. In addition, they do not really solve the power problem. “FPGA are fairly power-hungry,” says Mentor’s Lidrbauch. “There are a lot of resources in each device, and depending upon how the algorithm is mapped across the device, many of them may not be used. The fine-grained power domains that you may find in an ASIC are not often available on an FPGA. So if I am doing an FPGA accelerator coupled to an Intel processor, I will be paying that cloud provider for the cost of the devices and the power being consumed. If I can reduce the amount of power, that can be an advantage for the costs that I am incurring.”

That appears to be the consensus across the industry. “The FPGA benefits in performance and flexibility for AI applications are well understood. The programming model has been the most significant challenge,” said Jordon Inkeles, vice president of product at Silexica. “The challenges in programming models won’t be solved by a single flow. It will be solved by a combination of flows that match the expertise of the developer. HLS extracts away the hardware to enable software engineers to be successful, but that’s only until recently with tools like SLX FPGA from Silexica. Prior, most HLS users were still hardware engineers. HLS will provide the flexibility needed to take advantage of the FPGA for the software developer. However, TensorFlow and Caffe will still sit at a higher level of abstraction utilizing ML frameworks and libraries for data scientists to accelerate their algorithms.”

Neither of these devices was designed for AI, and outside of the datacenter, both solutions are too power hungry and expensive. “A year from now, nobody will be designing in GPUs or FPGAs to do any high-volume inference applications,” says Geoff Tate, CEO of Flex Logix. “Custom chips will deliver way more throughput per dollar. Customers either have a fixed dollar budget or fixed power budget, and they want to get more within those budgets. FPGAs and GPUs have been a good intermediate point, but they will not be the best solution a year from now.”

New approaches are necessary. “We saw FPGAs being used for machine learning, particularly in the early stages of AI when future hardware needs were very unclear,” says Dennis Laudick, vice president for Arm’s machine learning group. “As the technology and market for ML has stabilized, we have seen a demand for optimized ML solutions across the entire market and in all device classes. Along with this proliferation in interest, we have seen a proliferation of needs. For example, some only need infrequent or low demand ML, and costs are super critical to them. For others who have a GPU but can’t justify the additional silicon area of an NPU (neural processing unit), they can look to a GPU with improved ML support. Finally, where people require higher performance or where power efficiency is important, they can often turn to a dedicated NPU. However, even then they have a wide variety of needs, and this is why a range of devices are required for different markets.”

What needs to be configurable
Reconfigurability remains a key requirement in markets that are evolving. “There are many algorithms out there that are targeting audio, imaging, etc.,” says Mallet. “What you are basically looking at is a bunch of multiply accumulates (MACs) and memory tied together in various ways. When you look at ASIC or SoCs that are being developed specifically for those types of engines, you will see very similar structures. You will see MACs tied with interconnect, with a big chunk of memory at the basic level. There are variations in how many memories, what type of I/O, and how the interconnect is connected. At the very basic level, they all look like that. What is wrapped around it will change.”

Most chip developers are not releasing many details about their designs, but Flex Logix is prepared to talk about theirs. “We have clusters of 64 MACs able to do INT8xINT8 or BF16xBF16 at half the rate of INT8,” explains Tate. And reconfiguration is done in 2 microseconds. The way to get the maximum throughput per dollar or per watt is by hardening some stuff. Then you get better throughput, and although that means it will be somewhat less programmable, it will be more efficient.”

Fig. 1: Inside the Flex Logix AI chip. Source Flex Logix

This is a very different take on an FPGA. “It is an FPGA designed to be an inference engine,” says Tate. “The lookup tables you see are for implementing state machines, which control the operation of the MACs during the execution of a given layer. The dataflow will be dominated by the data coming out of SRAM going through a programmable interconnect that is wired to take it into the MAC clusters, and to activation and lookup tables, and then back to SRAM. That is a dedicated dataflow path. You also need state machines to do things like increment the address registers for the input, increment them for the output, count how many cycles you have run and know when to stop. They are fairly simple, but fairly high-speed going to the lookup tables.”

That distributed memory is important. “It needs a distributed memory to hold intermediate results and weights, sum products, etc.,” says Nilam Ruparelia, segment leader for communications and artificial intelligence/machine learning at Microchip Technology. “AI is ideally structured and suited for an FPGA because it needs continuous parallel processing. The layout of an FPGA looks very similar, with compute block interspersed with memory. FPGAs can provide massively parallel operations capacity, MACs. The equations are more-or-less static. The weights and the inputs and output keep changing, so you don’t need processors. You just need compute capacity.”

Fig. 2: An example of a neural network. Source: Microchip Technology

This helps explain the continued robust growth in FPGAs. “With an FPGA structure we have a big advantage in that we can field upgrade,” adds Ruparelia. “We can load in entirely new NNs and you change the logic. With an ASIC you can change the equations in the NN but the underlying engine remains fixed. It is very hard to build an ASIC that can cover different networks at the same time. We see applications where you need to be operating multiple NN concurrently.”

Details are important. “When targeting machine learning, subtly different architectural choices can have a dramatic impact on performance. For example, there is a significant amount of data locality that can be exploited in the fundamental matrix and vector math operations that comprise ML,” says Achronix’s Fitton. “Data locality is exhibited in shared weights, and re-used activation data between adjacent calculations. By tightly integrating the memories and processing elements, power is reduced because data movement is minimized. Performance is increased because routing resources are not expended to connect memory and MAC. This optimization approach has an additional benefit in adjacent applications that utilize matrix mathematics, such as beamforming and radar.”

What does performance mean?
A discussion about performance within AI processors will fill a future article, but the basic perception of performance as it used to exist for processors is beginning to change. “Not long ago, every talked about performance in terms of TOPS,” says Tate. “But having more TOPS does not correlate even loosely to having more throughput.”

Designing for benchmark performance can be problematic. “The software will continue to change rapidly, but you can’t design for something that doesn’t exist yet,” adds Tate. “You have to design for what you know is likely to be the most representative workload.”

Performance translates into utilization. “You want to use the MAC blocks as efficiently as possible,” says Jamie Freed, senior technical staff engineer and architect at Microchip Technology. “If you are using them every clock cycle, that would be ideal. That would be 100% efficiency. But that is not realistic, and it is usually in between 50% and 80% efficiency. Some networks have more layers than others, and some of those layers can be different. If you are doing lots of convolutions, then you are using a lot of MAC blocks and you will be using them very efficiently. If you have convolutions intermixed with different layers and non-linear functions in-between, which is what usually happens, then you are using more logic and not as many MAC blocks.”

FPGAs retain an advantage over fixed processors. “They have an advantage when supporting new, emerging requirements, where the processing resources can be flexibly modified for new algorithmic requirements, modified data flow, and variable numeric precision,” says Fitton. “A key requirement is to optimize TOPS per watt for edge processing. This is achieved by joint optimization of processing within optimized but flexible multiplier macros, and fine-grained programmability for arbitrary width arithmetic. Although FPGAs can be targeted for general purpose acceleration, it is possible to tune their architecture to maximize performance, whilst minimizing cost and power.”

Fitton points to three areas where innovation is required:

  • High-performance interfacing: Getting large amounts of data onto the chip.
  • Efficient data movement: There is a need to move that data around the chip.
  • Efficient compute: Processing data with constrained cost and power.

Barriers for FPGAs
The biggest barrier to adoption of FPGA-based processors is the programming model. “Most of the people working in the cloud are software developers and they don’t really understand what a clock is,” says Mallet. “They know what an interrupt is. So how do you bridge that gap? You have to have an IP library that you program the FPGA with, and standard API calls for them to use. If you don’t start standardizing some, or adopting some of the existing standards, then it makes it more difficult for software folks to utilize. As they start to look at the standardization of that API, the ultimate goal is to make it easier for software developers to utilize different blocks as if it were just a library.”

Money is being invested in solving that problem. “One development area for FPGAs is the relatively lower-level of the software tools,” says Fitton. “FPGAs typically were programmed with low-level RTL (VHDL or Verilog) but this is changing with the emergence of high-level tools and frameworks. Examples of the investment can be seen with vendor specific frameworks, such as the Intel One API, or third-party tools, such as Mentor’s High Level Synthesis (HLS), or targeted frameworks for ML, such as those from Mipsology. The common objective with these initiatives is to harness the inherent flexibility of the reprogrammable hardware, whilst providing a more abstracted design entry methodology.”

AI adds a few extra wrinkles into HLS. “Ten years ago, HLS had a lot of very strict rules about what would be synthesizable and what would not,” says Lidrbauch. “How do you figure out which parts benefit the most from going into hardware and which parts should stay as software? Even more complex, how do I figure out how those two should communicate? Should they share memory, should they have a data bus or a serial bus that talks directly between them? That has been a real problem for the software folks. There will be a lot of development in this area that will really advance AI. Hardware architectures are important to enabling flexibility, but the next level of sophistication in HLS or in the tools like Silexica’s that can recognize the intent of parts of the C code — those are going to be the enabling elements for AI.”

Related Stories
Defining AI Performance
What does performance mean when discussing artificial intelligence? Many minds are working the problem, but not yet in a unified manner.
FPGA Knowledge Center
Top stories, special reports, videos, blogs and more on FPGAs
FPGA Design Tradeoffs Getting Tougher
As chips grow in size, optimizing performance and power requires a bunch of new options and methodology changes.
How To Integrate An Embedded FPGA
Adding an eFPGA into an SoC is more complex than just adding an accelerator.


InAccel says:

Great article!
It is true that FPGA’s main barrier for wide adoption is the programming complexity.

Konstantin says:

FPGA can not achieve comparable with GPU/TPU performance level due to natural reasons. They have several times less frequency and flexibility also goes with the cost.

FPGAs have their place only in very narrow domains
– Low-precision inference (int8 or binarized)
– At-the-edge application, where DNN engine shares FPGA with camera/RF interface
– Low latency application, where single request inference time more critical than total throughput

Michael Mingliang Liu says:

Informative article. Good job.

Programming and running synthetization of FPGA have been widely deemed as a major challenge for years… A fond reminder of Altera’s OpenCL Compiler (now Intel’s)…

Job security of embedded engineers well versed in Verilog!

Leave a Reply

(Note: This name will be displayed publicly)