Optimizing Power And Performance For Machine Learning At The Edge

Design teams are looking to new design and verification flows to meet the competitive time-to-market windows of edge AI.


While machine learning (ML) algorithms are popular for running on enterprise Cloud systems for training neural networks, AI/ML chipsets for edge devices are growing at a triple digit rate, according to Tractica “Deep Learning Chipsets” (Figure 1). Edge devices include automobiles, drones, and mobile devices that are all employing AI/ML to provide valuable functionality.

Figure 1: Market data on AI/ML edge devices.

That type of market growth means that companies are seeing an explosion in edge hardware architectures optimized for power, performance, and area (PPA). Hardware design teams are struggling to keep up as ML moves into the mainstream and they are often not able to optimize AI/ML systems in a single iteration, sometimes causing them to abandon their initial attempts due to the cost of long RTL design cycles.

Moving AI/ML to the edge means that the focus is on custom designs that must successfully meet the critical requirements of low power use with high performance. And, if the traditional RTL design flow cannot accommodate their time-to-market windows in this highly competitive market, design teams must turn to a new design and verification flow.

Understanding the challenges
Optimizing an ML algorithm requires multiple passes through the design flow (Figure 2):

  • Algorithm engineers work in machine learning frameworks, such as Tensorflow or Caffe, to design and validate a ML algorithm.
  • This often includes quantizing the algorithm from floating point to fixed-point as well as pruning to reduce complexity.
  • Hardware designers then implement the algorithm, which requires building hardware of sufficient complexity to meet PPA requirements.

Some of these systems are too complex to analyze without building them first, which is not practical in hand-coded RTL design flows that can take three to six months to produce a working design. Couple this with ongoing revolutionary changes in both the algorithms and hardware, it can lead to teams abandoning their first attempts at hardware.

Figure 2: A machine learning design flow.

A key challenge to building custom hardware for ML inference engines is that power must be sacrificed for programmability. This is in large part due to the layer-by-layer behavior of the network which are used to build convolutional neural networks (CNNs). The weight storage requirements for CNNs dramatically increase for later layers, while the feature map storage requirements are largest for the early layers, and decrease substantially for the later layers. Additionally, the required precision for accurately implementing the network tends to decrease for the later layers.

These competing storage and precision requirements for CNNs make a “one-size-fits-all” hardware implementation inefficient for power. General purpose solutions can provide relatively high performance and small area but do so by “tiling” the ML algorithms and shuffling feature map data back and forth to system memory, which drastically increases power consumption. These general purpose solutions also sacrifice full utilization of on-chip computational resources for programmability.

Potential architecture solution
More power efficient approaches may require two or more hardware architectures that are custom built to address the memory storage, computational, and precision requirements of the different layers in the network. These compute engines will need to work in tandem and will require complex on-chip memory architectures, massive parallelism, and the ability to access high-bandwidth system memory.

For example, the early layers of the network could be mapped onto a fused-layer architecture or a multi-channel sliding window architecture. These architectures allow two or more of the first layers of the network to be computed without going off-chip to system memory. They also require relatively small amounts of on-chip storage since they only operate on a small “window” of the feature map data. For later layers, a multi-channel processing element (PE) array architecture is a good choice for both power and performance.

A better design and verification flow
One of the big challenges of building custom hardware solutions is that designers try multiple combinations of different architectures with different precision to find the best tradeoff between power, performance, and area. Doing this in RTL is impractical, so designers are turning to High-Level Synthesis (HLS) to implement these custom solutions.

Catapult HLS provides hardware designers with the ability to rapidly create and verify complex hardware architectures using C++/SystemC. HLS uses bit-accurate data types to allow modelling of the true hardware precision in C++ simulation. This means that not only can designers model the bit-for-bit behavior of the ML hardware in C++/SystemC, but they can verify ML designs in minutes instead of hours or days in RTL simulation.

Catapult HLS provides a design and verification flow that provides design teams an advantage when designing ML hardware, including:

  • Automatic memory partitioning: for creating complex on-chip memory architectures needed by the ML engine to achieve performance goals. These optimizations allow arrays in the C++ algorithm to be transformed into multiple memories operating in parallel.
  • Interface synthesis: allows arrays on the design interface to be automatically converted to high-performance AXI4 memory masters giving the core hardware transparent access to system memory, which is needed to fetch the millions of weights used by the ML algorithm.
  • Architectural code change and HLS optimization: can result in unique hardware with different PPA characteristics. Part of the architectural refinement step is being able to analyze and evaluate these tradeoffs interactively.
  • Power optimization: the tool can automatically optimize its RTL output for power and also report power for each step the designers take, allowing them to quickly determine the viability of their design.

As a final step in the architectural refinement process, the synthesizable C++, which has been designed using bit-accurate data types, can be plugged back into the ML framework (such as TensorFlow) allowing the algorithm designers to verify the implementation against the original algorithm.

Moving machine learning to the edge has critical requirements on power and performance. Using off-the-shelf solutions is not practical. CPUs are too slow, GPUs/TPUs are expensive and consume too much power, and even generic machine learning accelerators can be overbuilt and are not optimal for power. Creating new power/memory efficient hardware architectures to meet next-generation requirements calls for a HLS design and verification flow in order to successfully meet production schedules.

Learn more in our new whitepaper, Machine Learning at the Edge: Using HLS to Optimize Power and Performance.

Leave a Reply

(Note: This name will be displayed publicly)