A Bridge From Mars To Venus

The chasm between data scientist and embedded programmer.

popularity

In a now-famous 1992 pop psychology book titled “Men Are from Mars, Women Are from Venus,” author John Gray posited that most relationship troubles in couples stem from fundamental differences in socialization patterns between men and women. The analogy that the two partners came from different planets was used to describe how two people could perceive issues in completely different and sometime incompatible worldviews. In today’s modern world of electronics, the headlong rush to build and deploy machine learning (ML) based artificial intelligence into devices and systems creates its own Mars – Venus clash of cultures.

The cloud world

ML models are created by data scientists. Data scientists are (usually) trained as mathematicians. Data scientists amass huge, labeled datasets stored in cloud storage systems. They invent and train new models using the apparently limitless compute resources offered by cloud service providers (CSP). All of the leading ML training frameworks use floating point representation during training for numerical accuracy and produce breakthrough models that have millions or billions of Float32 parameters in the final model. For many, or indeed most, deployment (inference) scenarios running the model inference in the cloud is feasible and practical. Analyzing satellite photos once per day or making a shopping recommendation to a consumer browsing a website can easily run the inference on a cloud service with little regard to the energy consumed or the gigabytes of data transferred to run the single model inference. But what if the intended runtime target is not the cloud, but a constrained embedded device?

The embedded world

Hundreds of product categories are seeing ML migrate from the cloud down into the consumer and industrial device level. Examples include real-time cameras that enhance safety and accuracy; voice activated systems that speed the human-machine interface; and smart adaptable wireless networks. But these embedded systems have compute, storage and power limitations dramatically different from the limitless resources of the CSP world. Embedded systems programmers live in this resource-constrained world. Squeezing a program to fit code and data into limited memory is second nature to the experienced embedded developer. Writing lean C code to run bare metal or on a lightweight runtime is commonplace, while executing interpreted Python seems completely foreign.

Bridging the two worlds

The process of converting an ML model into an efficient implementation in an embedded device first begins with conversion of the model itself from using floating point math to an integer, fixed-point implementation. This occurs because fixed-point is 4x to 10x more energy efficient than floating point in most systems. Furthermore, to reduce model size (total memory footprint of the weights) the conversion is almost always from 32-bit float to 8-bit integer, further reducing memory cost and power consumed streaming the model into the chip to process an inference. Beyond the quantization into integer format, the process of compressing a model can also include the introduction of deliberate sparsity (zeroing out selected weights) and pruning out layers or channels that are not as critical as others. But who should perform this conversion? Should it be the data scientist who has ready access to the training environment and can measure the loss of model accuracy from the downgrading of the numerical precision from FP32 to INT8? Or should it be the embedded developer who understands the embedded tooling and the target silicon and can more easily determine if a model has been compressed enough to fit the target, but who doesn’t have the intimate knowledge of the model or the training data used to create it?

It would be challenging enough to bridge the gap between data scientists and embedded software engineers simply based on the differences in skills, experiences, and knowledge domains. But to add to that gap consider that in most cases these engineers don’t work for the same company! Rare is the highly vertically integrated company that employs the data scientist, the system designers and system programmers, and the silicon designers all under one roof. Indeed, the data scientist is from Mars (Company A) and the embedded developer is from Venus (Company B).

Fixed-function NPUs exacerbate the gap

Today’s current crop of ML-optimized silicon generally employs a combination of programmable processors (CPU or DSP or GPU) combined with a highly-optimized, fixed-function hardware accelerator – often referred to as an offload engine, or NPU (neural processing unit) accelerator. These accelerators are often so specialized that only the embedded developer – or sometimes only the NPU developer directly – can perform the tortured optimizations needed to get an ML model to run on the silicon, split between part of the workload running on the CPU and part on the accelerator. By the time the solution is debugged and running optimally, weeks or months of optimization may have occurred, and the final ML network looks nothing like the beautifully crafted floating-point model that left the data scientist’s hands.

But all that hard work won’t be wasted, because of course machine learning models are slow to evolve and that hand-crafted solution can be used for years to come. Not. Even. Close. The sad reality is that the long, pothole filled one-way bridge between Mars and Venus only lasts for a few weeks or months before a newer, more-accurate ML model is invented by another data scientist, and the process has to start over. With the fast-changing world of deep learning, the newest model might not look the same as the old model. And that means the new model might not fit on the NPU accelerator at all. (See vision transformers for proof of this pitfall.) Embedded system-on-chip (SoC) developers need a better bridge between the data science and the embedded worlds.

GPNPUs – A better bridge

Nothing can make Martians and Venusians completely compatible. But a general-purpose neural processor (GPNPU) from Quadric can build a stronger, sturdier, more understandable bridge. Chimera GPNPUs are fully programmable, unlike fixed-function NPU accelerators. And Chimera GPNPUs run C++ code – including C++ that is compiled directly from neural net graphs. An ONNX graph compiled into C++ code by Quadric’s Chimera Graph Compiler is captured in human-readable C++. The data scientist can still recognize her model even after it has been optimized to run on an embedded device with a Chimera GPNPU. Optimizations made to the model at the C++ level can be understood at the functional and numerical level both by the Martian and the Venusian, enabling both to continue to collaborate to optimize the solution.

Quadric cannot solve your relationship problems. But we can help solve your machine learning inference design challenges. Find out more at www.quadric.io.



Leave a Reply


(Note: This name will be displayed publicly)