Where The Rubber Hits The Road: Implementing Machine Learning On Silicon

Moving from high-power, high-performance to low-power, high-performance.


Machine learning (ML) is everywhere these days. The common thread between advanced driver-assistance systems (ADAS) vision applications in our cars and the voice (and now facial) recognition applications in our phones is that ML algorithms are doing the heavy lifting, or more accurately, the inferencing. In fact, neural networks (NN) can even be used in application spaces such as file compression [1].

In the world of ML, you often hear about the complexity and difficulty of training the models. Training is often done in the cloud via services like AWS or Google Cloud using deep learning frameworks such as TensorFlow or Caffe, but it can also be done with CPUs or GPUs—often farms of them. More recently, specialized products like Intel Nervana, NVIDIA Volta, and Google Cloud TPU have been introduced specifically to reduce training time. In any case, training is generally a high-performance, high-power application.

For many types of ML applications, inferencing via the trained model (i.e., leveraging the trained model to do the task at hand) may be able to remain in the cloud. For example, the cloud is an easy and inexpensive way to leverage proprietary ML algorithms for making financial predictions.

In other cases, such as automotive, security, mobile, or IoT applications, that type of implementation may not be feasible. For example, would you want the autonomous braking system in your car to involve roundtrips to the cloud? (and even that would assume the cloud is always available, no matter where you are driving.) Instead, the implementation must be local to the device, be it an automobile, security camera, watch or sensor.

That is where low power comes into play. Although the power profiles of those automobiles, security cameras, watches and sensors are quite different, they each require an order of magnitude reduction in power from the CPU/GPU cloud/farm implementations. And of course, a farm of GPUs doesn’t quite fit on your wrist watch.

Local inferencing options
Instead, the trained ML model must be implemented in a much smaller and energy-efficient way on the silicon (ASIC or FPGA) and software of the device. As shown in Figure 1, two typical implementation options involve using an embedded neural network-optimized digital signal processor (NN DSP) (e.g., Cadence Tensilica Vision C5 DSP) or creating a customized implementation of the trained model via high-level synthesis (e.g., Cadence Stratus HLS).

Both implementation options are far more power-efficient than the CPU or GPU implementations, and both options are small enough to fit on custom silicon. The main tradeoff becomes flexibility versus power efficiency.

The NN DSP executes instructions, so changing the firmware allows the NN DSP to implement different customized NN topologies, even when in the field via a firmware upgrade. On the other hand, the custom hardware will implement a specific ML topology (usually but not always a NN). While retraining or reinforcement training is typically possible on both implementations, only the NN DSP will allow the NN topology to be fundamentally changed.

On the other hand, the custom hardware will be the most energy-efficient, because it implements (only) the trained ML model. The datapath is fully customized for that specific model, and it doesn’t have to do, for example, instruction fetches. Both improve the power efficiency at the cost of flexibility.

Why high-level synthesis for custom ML hardware?
Implementation of the trained ML model is usually done with HLS, although it theoretically could be done by hand-writing RTL. The advantage of HLS is that it maximizes the advantages of customized hardware in terms of power- (and area- and performance-) efficiency. Because HLS starts with a high-level model in SystemC or C++, it can be easily modified to explore different implementation options.

As a trivial example, the bitwidths of the computations can be fine-tuned to dial in the exact power, performance, area and accuracy (PPA-A) characteristics of the implementation. While the vanilla trained ML model is often using floating-point precision that comes “for free” when using a CPU or GPU, suitable accuracy can often be had with 8-bit or smaller fixed-point implementations at a fraction of the power and area.

Similarly, the architecture itself can be fine-tuned. Some connectivity can be localized, the activation functions can be changed and even the number of hidden layers can be modified to find the best PPA-A trade-off for the given end-application.

Implementing these changes wouldn’t be feasible in RTL given the time required to write each RTL implementation.  On the other hand, in SystemC, these changes can be trivial. Each model should be executed to determine the accuracy. If good enough, synthesize via HLS to get power, performance, and area metrics for that implementation. Change the model and repeat the process, and eventually you end up with a range of implementations from which to choose the best PPA-A profile for the application.

[1] T. Chen et al., “BenchNN: On the broad potential application scope of hardware neural network accelerators,” 2012 IEEE International Symposium on Workload Characterization (IISWC), La Jolla, CA, 2012, pp. 36-45.

Leave a Reply

(Note: This name will be displayed publicly)