Design teams are looking to new design and verification flows to meet the competitive time-to-market windows of edge AI.
While machine learning (ML) algorithms are popular for running on enterprise Cloud systems for training neural networks, AI/ML chipsets for edge devices are growing at a triple digit rate, according to Tractica “Deep Learning Chipsets” (Figure 1). Edge devices include automobiles, drones, and mobile devices that are all employing AI/ML to provide valuable functionality.
Figure 1: Market data on AI/ML edge devices.
That type of market growth means that companies are seeing an explosion in edge hardware architectures optimized for power, performance, and area (PPA). Hardware design teams are struggling to keep up as ML moves into the mainstream and they are often not able to optimize AI/ML systems in a single iteration, sometimes causing them to abandon their initial attempts due to the cost of long RTL design cycles.
Moving AI/ML to the edge means that the focus is on custom designs that must successfully meet the critical requirements of low power use with high performance. And, if the traditional RTL design flow cannot accommodate their time-to-market windows in this highly competitive market, design teams must turn to a new design and verification flow.
Understanding the challenges
Optimizing an ML algorithm requires multiple passes through the design flow (Figure 2):
Some of these systems are too complex to analyze without building them first, which is not practical in hand-coded RTL design flows that can take three to six months to produce a working design. Couple this with ongoing revolutionary changes in both the algorithms and hardware, it can lead to teams abandoning their first attempts at hardware.
Figure 2: A machine learning design flow.
A key challenge to building custom hardware for ML inference engines is that power must be sacrificed for programmability. This is in large part due to the layer-by-layer behavior of the network which are used to build convolutional neural networks (CNNs). The weight storage requirements for CNNs dramatically increase for later layers, while the feature map storage requirements are largest for the early layers, and decrease substantially for the later layers. Additionally, the required precision for accurately implementing the network tends to decrease for the later layers.
These competing storage and precision requirements for CNNs make a “one-size-fits-all” hardware implementation inefficient for power. General purpose solutions can provide relatively high performance and small area but do so by “tiling” the ML algorithms and shuffling feature map data back and forth to system memory, which drastically increases power consumption. These general purpose solutions also sacrifice full utilization of on-chip computational resources for programmability.
Potential architecture solution
More power efficient approaches may require two or more hardware architectures that are custom built to address the memory storage, computational, and precision requirements of the different layers in the network. These compute engines will need to work in tandem and will require complex on-chip memory architectures, massive parallelism, and the ability to access high-bandwidth system memory.
For example, the early layers of the network could be mapped onto a fused-layer architecture or a multi-channel sliding window architecture. These architectures allow two or more of the first layers of the network to be computed without going off-chip to system memory. They also require relatively small amounts of on-chip storage since they only operate on a small “window” of the feature map data. For later layers, a multi-channel processing element (PE) array architecture is a good choice for both power and performance.
A better design and verification flow
One of the big challenges of building custom hardware solutions is that designers try multiple combinations of different architectures with different precision to find the best tradeoff between power, performance, and area. Doing this in RTL is impractical, so designers are turning to High-Level Synthesis (HLS) to implement these custom solutions.
Catapult HLS provides hardware designers with the ability to rapidly create and verify complex hardware architectures using C++/SystemC. HLS uses bit-accurate data types to allow modelling of the true hardware precision in C++ simulation. This means that not only can designers model the bit-for-bit behavior of the ML hardware in C++/SystemC, but they can verify ML designs in minutes instead of hours or days in RTL simulation.
Catapult HLS provides a design and verification flow that provides design teams an advantage when designing ML hardware, including:
As a final step in the architectural refinement process, the synthesizable C++, which has been designed using bit-accurate data types, can be plugged back into the ML framework (such as TensorFlow) allowing the algorithm designers to verify the implementation against the original algorithm.
Moving machine learning to the edge has critical requirements on power and performance. Using off-the-shelf solutions is not practical. CPUs are too slow, GPUs/TPUs are expensive and consume too much power, and even generic machine learning accelerators can be overbuilt and are not optimal for power. Creating new power/memory efficient hardware architectures to meet next-generation requirements calls for a HLS design and verification flow in order to successfully meet production schedules.
Learn more in our new whitepaper, Machine Learning at the Edge: Using HLS to Optimize Power and Performance.
Leave a Reply