How Dynamic Hardware Efficiently Solves The Neural Network Complexity Problem

As AI models continue to expand in complexity and size, tiny inefficiencies get multiplied into large ones.

popularity

Given the high computational requirements of neural network models, efficient execution is paramount. When performed trillions of times per second even the tiniest inefficiencies are multiplied into large inefficiencies at the chip and system level. Because AI models continue to expand in complexity and size as they are asked to become more human-like in their (artificial) intelligence, it is critical to ensure fast execution.

Increasing the difficulty, multi-layer inferencing models such as YOLO, ResNet, Inception, BERT, and attention models often require very different types of processing (referred to as operators) for each of many layers in a network. Each layer in a machine learning (ML) model will perform an operation on a set of input data. This operation may be replicated several hundred times with unique filters for each operation as shown in table 1. Table 1 shows the convolutional and residual layers for the DarkNet-53 backbone used by Yolov3. The backbone extracts features from input images for use in object detection workloads. DarkNet makes heavy use of convolution, as can be seen in the table.

Each layer of a model can be thought of as a subroutine, containing billions of computations across hundreds of filters. Then there will be dozens to hundreds of layers in a model performing convolution of varying sizes and filter counts, cascaded one after another. Other techniques, such as MaxPool or AvgPool that reduce the pixels (or activations) of a layer, may also be applied to a few layers, and these are critical to the accuracy of machine learning models. The challenge of any ML inference solution is to ensure that all of these different layer subroutines run as efficiently as possible on the accelerator hardware.

In addition to the processing of tensors (multi-dimensional arrays of numbers), there is significant data movement of input data, activations, weights, and results. This data movement is required in all current neural network models. Data movement, if not managed properly, can also drive inefficiency in processing. No matter how efficient your processing elements are, if the tensor processors are required to spend hundreds of cycles waiting for data to arrive from memory, the system will inevitably lose efficiency.


Table 1: DarkNet-53 backbone as used in Yolov3

Dynamic tensor processing drives efficiency in model execution

The Flex Logix InferX X1 AI inference processor offers a unique approach to dealing with the computationally intense yet irregular complexity of inferencing models. We call this approach dynamic tensor processing. In dynamic tensor processing, the architecture of the tensor processing units themselves can be modified to optimize the TPUs structure for the specific needs of any particular layer in a machine learning inference model. They can then be reconfigured with virtually no overhead to match the needs of the next layer of the model as execution proceeds.

The dynamic tensor processing approach of InferX allows for the processing elements to be converted in just a few microseconds (usually overlapped with data transfer operations) to the optimal structure for the next layer of the network while the activations are stored in and then read from local memory structures. This also reduces the bandwidth and capacity requirements of external memory in a high-performance system.

As an example, using the dynamic tensor processor architecture, the hundreds of filters in a convolutional layer can be processed in parallel by the tensor processors, yielding enormous performance and efficiency gains.

Other approaches to AI inference processing may include up to ten times as many tensor processing elements into a design which drives up both cost and power dissipation as the silicon required for processing grows commensurately. These massive tensor arrays also increase execution latency through the constant movement of data between memory and tensor processors. While the replication of tensor elements can lead to eye-popping TOPS numbers, the reality is the utilization of those tensor units is quite poor, yielding as low as single digit utilization percentages, for low-latency Batch=1 workloads.

While varying with workload, the dynamic tensor processor approach offers leading performance and execution efficiency as measured in Inferences/Watt and Inferences/$.

Flexibility for evolving model requirements

The machine learning industry is rapidly advancing with new innovative approaches and applications. A look at ML papers published on arXiv (https://arxiv.org/list/stat.ML/recent) shows thousands of new papers submitted every month! These papers cover both new model designs, enhancements of existing designs and also new use cases for existing models. It’s an exciting time. It’s also a perilous time for chip developers as committing to an ML accelerator architecture for today’s important models could leave a design without a path forward if new and better approaches are found that require different hardware configurations and optimizations. The winning solutions will offer not just software, but also hardware flexibility as the best way to ensure that a given edge inferencing technology will have relevance and longevity for many years.

The same dynamic tensor processor technology that provides for efficient execution of current generation ML models also provides a path to supporting new operators and techniques for ML models that haven’t been developed yet. While organized around the foundational multiply-accumulate operation, the InferX technology can support both floating point and integer data types of varying sizes. It can also reconfigure the connectivity paths amongst the tensor processing elements to support new dataflows. This approach leads to high confidence in the ability of the InferX technology to support new, yet to be invented, inference acceleration architectures for vision and other workloads.

In summary, if you are looking for the best performance and efficiency for your edge inference application and also want to use a technology that will be future-proofed and can adapt to new techniques offering better accuracy or speed, then you should consider the use of dynamic tensor technology.



Leave a Reply


(Note: This name will be displayed publicly)