Understand both the ML model and the target hardware to get the most out of inference accelerators.
The advent of machine learning techniques has benefited greatly from the use of acceleration technology such as GPUs, TPUs and FPGAs. Indeed, without the use of acceleration technology, it’s likely that machine learning would have remained in the province of academia and not had the impact that it is having in our world today. Clearly, machine learning has become an important tool for solving challenging problems such as autonomous vehicle navigation, automated language translation, and industrial computer vision. The need for and use of acceleration will only continue to grow and it’s important to recognize the value of designing machine learning models with acceleration in mind.
Once an organization has decided to deploy a machine learning model to production, whether in the cloud or the edge, further decisions need to be made on the platform for the model. It is critical to ensure that the model will run optimally for inferencing, where it’s deployment can be both broad (running on millions of systems) and long (running for years). Here it becomes very important to ensure that the developer understands both the ML model and the target hardware that it is designed to run on and that the model execution is as efficient as possible. The reality is that making the proper design decisions up front can lead to significant gains in performance, efficiency, and accuracy when using inference acceleration hardware.
Below are a few approaches used in common AI inference accelerators where the model is designed to take best advantage of them. These techniques will ensure that you get the most performance and efficiency possible for your edge AI application.
The first and most widespread and known acceleration technique is called “quantization.” While a model is being trained and tested, all the data that is being used in the model for activations and weights is typically in floating point format. While both CPUs and GPUs can process data in floating point formats quite efficiently, there is a huge internal hardware silicon and power cost associated with this. As a result, dedicated accelerators usually try to use a lower precision arithmetic where possible. This approach is more efficient in silicon and power.
Common data formats used today are 8-bit and 16-bit integers (INT8 and INT16) and a 16-bit floating point format (BF16). Computationally intensive operators that involve big matrix multiplications, which is usually used in Convolution, Fully Connected and Transformer layers, get the most benefit of this reduction in precision. However, there is a loss of accuracy associated with using INT8 arithmetic instead of the floating point. Depending on the application, this loss of accuracy may be acceptable in exchange for higher throughput and lower power and cost of operations. Techniques that Flex Logix uses can achieve loss of accuracy of less than 1% for most common models, but all model developers should be aware that the inference processing will likely make use of lower precision data types and plan accordingly.
Heavy matrix computation in accelerators is usually performed in hardware blocks that have some level of architected parallelism. For example, a certain compute block can perform the multiplication of a vector of 32- elements (int8) (these are usually activations) by a matrix size of 32×32 (these are usually weights) with a 32- element vector result (usually next layer activations before the activation function). Larger matrix multiplications are then constructed from these smaller blocks.
For example, if we have a 2D convolutional layer operation with 47 input channels in the input feature map, a 2D convolution window size of 3×3 and 94 output channels, the “filter size” or the number of elements in the input vector in such a case will be 3*3*47 = 423, and it will need to be multiplied by a 423×94 matrix. Obviously, mapping such a matrix multiplication onto a 32×32 multiplier block will create some unused computational capacity which will result in lower compute resource utilization. If the number of input channels and output channels was instead divisible by 32, utilization could be 100%.
The described problem becomes even much more pronounced in Depth-Wise Convolutions. In Depth-Wise Convolutions, calculation of a long filter over window size and channels is split into calculation of two separate smaller filters – one over window size, and another over channels. When both of those numbers are not multiples of the inherent hardware resource parallelism, the problem of underutilization of hardware resources becomes even worse. Model developers should consider the inherent parallelism of accelerators and structure models accordingly for maximum efficiency of execution.
In the last section we described one condition for a high compute utilization rate – good mapping to compute resources. There is another important condition that affects utilization rate – support of high parallelism in memory access.
Usually, an accelerator will contain many basic compute blocks with certain inherent parallelism, which in many cases need to be used in parallel on a set of parallel data. For example, in case of 2D convolution, such blocks will execute small matrix multiplication computation on small parts of the same convolution filter, as well as on multiple filters (multiple convolution window computation). In both of those cases, data needs to be fed in a highly parallel fashion into computation blocks. Usually, it is done by utilizing internal accelerator small block SRAMs, which potentially have a huge parallel bandwidth. The problem is how to route the data from these small SRAM blocks to the compute blocks. This is done by flexible multiplexors that connect the SRAMs to the compute blocks. However, these multiplexors do not have absolute flexibility of routing any byte to any byte because the connectivity paths are typically limited in order to save silicon area. Therefore, the data access patterns ideally will follow the pattern, supported by the accelerator multiplexors, otherwise the accelerator will not be able to provide enough data in parallel for compute and the whole system will be slowed down. A great example is the “Shuffle” operator in ShuffleNet, which pseudo-randomly takes some channels from a feature map into a set of channels to process in the next layer. This approach, while feasible for software implementation, is very difficult to efficiently implement in accelerator datapath hardware and will drive hardware underutilization when used. Models should avoid the use of operators that have been shown to be difficult to implement in parallelized accelerator hardware.
We strongly believe that important model characteristics in addition to working to reduce weights and MAC operations and increasing accuracy should also take note of a model’s inherent “accelerator friendliness.”
ML accelerators can provide significant advantages in performance and efficiency in processing ML models. In a similar manner, accelerator-aware design of ML models can help to ensure that the full capabilities of ML acceleration hardware can be realized.
Developers of ML models should work with ML accelerator developers such as Flex Logix to ensure that their models will achieve the highest performance and efficiency without sacrificing accuracy.
Leave a Reply