Nightmare Fuel: The Hazards Of ML Hardware Accelerators

Fixed-function accelerators embedded in silicon only stay useful if models don’t adopt new operators.

popularity

A major design challenge facing numerous silicon design teams in 2023 is building the right amount of machine learning (ML) performance capability into today’s silicon tape out in anticipation of what the state of the art (SOTA) ML inference models will look like in 2026 and beyond when that silicon will be used in devices in volume production. Given the continuing rapid rate of change in machine learning algorithms, making design choices for an uncertain workload of the future is perhaps one of the biggest design headaches that an SoC architect can face in today’s marketplace.

The vast majority of media coverage of the stunning evolution of ML focuses on the seemingly endless growth in model size and training data set size. One of the most cited references are the charts from OpenAI showing the logarithmically growing training times for SOTA models, often extending into weeks-long duration across entire racks of training hardware. But whether a model takes a day to train or a month to train is of no consequence to the designer of a device that will only run inference of trained models.

The other focus of most media coverage is the resulting size of a model, expressed as the total number of model parameters, aka “weights” in the model. The total model size does have some bearing on the choice of system resources to put into an edge device or consumer product, mainly ensuring adequate off-chip storage (DRAM) in the system. But as long as tomorrow’s workload is functionally similar to today’s reference SOTA model, the SoC designer doesn’t care if that future model has double or quadruple the weights: the tradeoff between parameters and inferences per second – bigger models run slower on the same processing core – does not impact the choice of compute IP core to put in the silicon.

So, if the SoC architect doesn’t really care about the absolute size of the ML model, or care how long the data scientist took to train the model, what facet of ML model evolution does she care about? The answer: what are the building blocks – the machine learning operators – of the latest ML network! If the operators don’t change between now and 2026, but only complexity and size change, then the architect can be certain that her choices of compute IP will continue to be fit for function. Why is that? The vast majority of inference acceleration engines (aka “NPUs”) in SoCs today are fixed-function accelerators paired with programmable CPUs, DSPs or GPUs. These NPUs are designed to offload the most commonly occurring and most compute intensive graph operators from the programmable core, leaving the programmable core to run rare or uncommon portions of the ML graph.

ResNet50 is obsolete

The greater the percentage of the graph running on the performance-efficient NPU the better the system performance. In fact, several vendors of commercial NPUs are proud to proclaim in their marketing literature “100% of ResNet50 runs on our NPU, freeing your CPU to focus on other tasks.” (Note the implication that if your eventual production ML network does NOT run 100% on the NPU, much of the workload falls upon the system CPU, revealing the Achilles Heel of the hardwired NPU accelerator.)

Let’s examine what it means for an NPU to run “an entire ResNet50” graph. ResNet50 was introduced in an academic paper in 2015. By 2017 it had become something of a gold-standard performance benchmark for CNN-based image classifiers. It achieved that status both because of the accuracy that was possible with the network but also because it was amenable to acceleration on a variety of ML accelerator hardware in edge, endpoint and consumer device systems. The simplicity of the number of different types of operators in ResNet50 explains why it was relatively easy to accelerate.

The ResNet-50 ONNX Op List contains only 8 basic ML Operators:

  • Add
  • Convolution
  • Flatten
  • Gemm
  • GlobalAveragePool
  • MaxPool
  • Relu Activation
  • SoftMax

Further, ResNet50 employs only three straight-forward variants of convolutions: 7×7 Conv stride 2; 3×3 Conv and 1×1 Conv. Even SoC architectures that didn’t support the 7×7 convolution in the ML accelerator could still deliver decent performance because only the first layer uses the 7×7 Conv and all the remaining 49 layers use the more common, easier-to-implement 3×3 and 1×1 convolutions. Furthermore, most implementations of ResNet did not attempt to accelerate the complex SoftMax final layer in the NPU accelerator, but rather passed that final step to a powerful host applications processor CPU to sort normalize detected values and present the answer of the most likely detected item.

Many SoCs were designed in the past several years with NPU accelerators that used ResNet50 as the benchmark yardstick. The architects of those SoCs could sleep well at night as long as the rapid evolution of CNNs didn’t stray too far away from the types of operators (and their specific variants, like stride depth, etc) baked into the hardwired accelerators. For several years after the introduction of ResNet in 2015, that turned out to be true and SoC designers slept well and dreamed idyllic, happy dreams.

Operator churn

Even for the several years in which the ResNet family retained benchmark relevance, rapid change and churn in commonly used operators swirled in the background. The ONNX interchange format gives a good litmus test of the pace of change of operators. Introduced in 2017, ONNX has progressed through 19 different versions of officially supported operators in just 6 years, with today’s ONNX OpSet19 containing 183 operator types, each with subvariants. With that much churn in operator sets being used by data scientists, it was only a matter of time before ResNet’s reign of SoC benchmark supremacy came to an end.

Challenges of the newer ViT models

Constant churn of readily used ML operators in the training frameworks is nightmare fuel for SoC architects. The fixed-function – hence unchangeable – accelerators embedded in silicon only stay useful and relevant if the SOTA models don’t use different, newer operators. The nightmare became real for many of those chip designers in 2021 with the introduction of the Vision Transformer (ViT) class of models. Offering superior results and ushering in a new wave of benchmark tests, ViT class models used a dramatically different set of basic ML operators. A comparison of CLIP ViT-B 32 versus Resnet50 shows this change in stark detail.

Only 5 operator types are shared in common between the 2017 SOTA benchmark model and today’s 2023 SOTA benchmark model. Of the 24 operators in today’s ViT model, an accelerator built to handle only the layers found in ResNet50 would run only 5 of the 24 layers found in ViT – excluding the most performance impactful operator – the 32×32 stride 32 dilated convolution.

The bottom line – a hardwired accelerator optimized in 2017 for ResNet would be fundamentally broken – almost useless – in trying to run today’s SOTA ML model. History is bound to repeat. Surely we should anticipate that 2027 will herald new models with new operators that would render a hardwired accelerator optimized for today’s ViT to be equally fated to early obsolescence. Thus what should an SoC architect do today to wrestle with this nightmare scenario?

The fully programmable GPNPU – performance plus programmability

Luckily for sleep-deprived chip architects around the world, Quadric’s recently introduced Chimera general-purpose neural processor (GPNPU) uniquely solves the performance versus flexibility tradeoff. Optimized for machine learning inference performance, and available in 1 TOPS, 4 TOPS and 16 TOPS configurations, Chimera GPNPUs deliver high, sustained multiply-accumulate (MAC) performance while maintaining full C++ programmability.

Any operator today – and any future operator – can be quickly and easily programmed by a software developer to run at-speed on Chimera cores, utilizing all of the matrix-optimized MAC resources of the GPNPU. A Chimera GPNPU powered device runs today’s ML models efficiently and will also run tomorrow’s models efficiently as well. Learn more at www.quadric.io.



Leave a Reply


(Note: This name will be displayed publicly)