Why TinyML Is Such A Big Deal

Surprisingly, not everything requires lots of compute power to make important decisions.


While machine-learning (ML) development activity most visibly focuses on high-power solutions in the cloud or medium-powered solutions at the edge, there is another collection of activity aimed at implementing machine learning on severely resource-constrained systems.

Known as TinyML, it’s both a concept and an organization — and it has acquired significant momentum over the last year or two.

“TinyML deployments are powering a huge growth in ML deployment, greatly accelerating the use of ML in all manner of devices and making those devices better, smarter, and more responsive to human interaction,” said Steve Roddy, vice president of product marketing for Arm‘s Machine Learning Group.

Only so many applications can be realized on systems with little memory and limited computing power, some of which will be battery-powered. But for those applications, there is a dedicated version of TensorFlow Lite and an ongoing stream of new ideas for implementing complex logic with a light footprint.

Based on the incredible attention being paid to complex edge inference platforms and raw power in the data center, it may seem like folly to try to do that same work in a meager platform. And yet a growing number of people are doing just that. So is TinyML a new hardware platform? A software tool? A completely new AI methodology? It’s none of the above, and some of each.

TinyML appears to be two things. Informally, it’s about getting machine learning to work on resource-constrained systems so small that they may not even be running a full operating system (OS). “TinyML is just a name for doing machine learning on constrained devices,” said Allen Watson, product marketing manager at Synopsys.

Formally, it’s a community under the aegis of the TinyML Foundation, started by individuals from Google and Qualcomm, that sponsors frequent online get-togethers and an annual conference to share new ideas and ways of solving some of these tough problems.

“The organization, TinyML.org, is really just a place where people who are interested in those types of applications can come and share ideas,” said Jamie Campbell, software engineering manager at Synopsys.

Focus on light applications
The driver behind TinyML is the notion that some problems don’t need a data center server, or even a fancy edge platform, to solve. That said, it’s not that they’ve discovered how to do everything on an Arduino board. Instead, it’s about finding problems that can be solved that way, improving the solutions, and pushing back the boundaries for which higher-powered computing is truly required.

“ML carried out solely in the cloud can be expensive in terms of device battery power, network bandwidth, and time to transmit data to the data center,” noted Arm’s Roddy. “All of these costs can limit how widespread the adoption of ML becomes. TinyML enables ML to be done globally, as the ML inference can be carried out on the device where the data is generated.”

In many cases, a frugal partial solution may be used in a tiered approach, where a small, always-on piece of functionality determines when to pay attention so that higher-powered hardware and software can kick in. That higher-powered edge solution even can serve as a tier to sending the big problems to the cloud for resolution.

A good example of that is always-on sound detection, which can be used to detect whether a voice has been heard. That’s a small enough problem to do without much computing. If a small engine determines it was a voice, then it can wake up more circuits to see if what was said was a “wake word.” If that also turns out to be the case, then the entire uttered phrase can be sent to the cloud for parsing.

“Imagine the Internet bandwidth that you would consume by sending a constant stream of audio data up to the cloud,” said Synopsys’ Campbell. “They get around that problem by saying, ‘Let’s do the simple processing locally on the device and save a bunch of bandwidth.’ TinyML is usually applicable in that very first step, which is the always-on detection.”

On the other hand, what was heard might instead be the sound of breaking glass, which would indicate a security problem in a home. In that case, an alert immediately could be sent without further intervention from the cloud (even if the alert is actually communicated through the cloud).

This is even more evident with images, and particularly streaming video, where the amount of data can quickly overwhelm an edge device. “It doesn’t make sense to send everything to the cloud,” said Cheng Wang, senior vice president of software and engineering at Flex Logix. “There are too many cameras and too much data to process everything efficiently. The latency requirements are very stringent and the bandwidth requirements are too high. You have to have something closer to the device, where you’re not doing anything on a cellular or wireless network. That’s the first batch of processing. Once the first batch of intelligence has been established, whatever comes out of that is only going to give you interesting results every once in a while. Bounding objects in a frame is a much, much lower data footprint than the actual raw video. It’s much easier for the machines that come after that to digest the processed data than the raw data.”

TinyML is one of several approaches being developed to solve this problem.

Motivations for TinyML
Evgeni Gousev, chairman of the board of TinyML Foundation and senior director of engineering at Qualcomm, was one of the founders of TinyML. “Cloud-based ML has three fundamental problems,” he said. “One is energy efficiency. The second is latency. The third one, which is becoming even more important, is privacy, because everyone is concerned about Big Brother.”

TinyML also was motivated by the fact that many aspects of AI design tend to involve isolated specialized silos, and each silo is working on its own.

“This sparked the idea that we should bring together more hardware designers, algorithmic designers, and application designers, [because] there are too many isolated fields here,” said Marian Verhelst, director at the TinyML Foundation.

Everything has to be co-optimized together in order to stuff AI into tiny systems, and so this organization breaks down the silos and gives all of the players an opportunity to work together.

But given how hard so many people are working on high-powered solutions, how can anyone expect a small embedded system to handle this kind of problem? Even though engineers asked that question in the beginning, they are finding you can do more than you expect with very little.

Target platforms
Data center AI solutions rely on massive collections of servers. Each of those servers may leverage high-end microprocessors, but more likely it’s calling on dedicated ML accelerators. Such systems lie at the opposite end of the spectrum that TinyML is addressing.

But even the edge solutions that are hitting the market tend to involve carefully crafted hardware architectures that typically operate as accelerators to a host processor. That host processor may share silicon with the accelerator hardware, or the accelerator may assume that the host is on another chip in the system.

The goal for such systems is to provide plenty of optimized computing and efficient access to lots of memory, where large ML models can be stored for retrieval during processing. This requires complex software and an OS that can provide access to various files and dynamic memory as needed.

But that’s still far more resources than a system for TinyML is likely to have. The declared intent at present is to target microcontroller-based systems. MCUs are the only processing elements available on such small systems. “The goal is to reduce the size of machine learning until it fits on the microcontroller,” said Yipeng Liu, product marketing group director for Tensilica audio/voice IP at Cadence.

In such a system, memory will be restricted and computing will be limited, with clock frequencies operating well below the gigahertz range. Importantly, there may not be a full OS. That severely limits what can be done — or it means replacing the OS services with code explicitly worked into the application software.

In typical ML processing, “the model that’s generated is a file,” said Liu. “If you’re using an RTOS, you will not have a file system. Without a file system, it’s hard for an embedded device to read the file. So you have to convert this file into something that you can compile into your code on the MCU.”

As accelerators evolve to serve these small devices, they may take over much of the heavy lifting from the MCU. But such accelerators would need to be in the low-power range.

“For the first time, we’ve started to make custom AI chips within the milliwatt or even lower power budgets, really optimized for TinyML,” noted TinyML Foundation’s Verhelst.

And that’s just the beginning. “Future chips over the next five years are expected to have vastly superior ML capabilities, with the same power consumption and with dedicated accelerators that will open up vision-based applications, as well,” said Sree Harsha, product marketing manager for IoT, compute, and security at Infineon.

An ML engine for a resource-constrained system
Today, these systems tend to be plain-vanilla platforms like Arduino and Raspberry Pi. That means that the solutions generally cannot rely on specialized hardware to help with the problem. It becomes more of a software problem.

Key to the processing of most AI models is an engine like TensorFlow. TensorFlow, and other tools like it, operate in the data center with access to all of the resources necessary to solve big problems as quickly as possible. For inference, TensorFlow interprets a model, keeping the level of abstraction high. That makes it easier to investigate different models and implement the chosen ones quickly, with less software development.

For edge platforms, TensorFlow has been adapted into TensorFlow Lite. This engine can work on systems at the edge, which have access to fewer resources than the data center. Exactly what the “edge” means can vary. For some, it means the “edge of the core network,” which typically would be an on-premise server that acts as a gateway to the internet. Such a system is likely to be well equipped, while not providing the scale-out capabilities of the cloud.

But the true edge — the end nodes in the network — are not likely to have access to those resources. That may restrict their ability to run TensorFlow Lite.

While some developers opt for compiling their solutions to native code in a process sometimes referred to as “graph lowering,” many still opt for the interpreted approach. Interpreters are less efficient than native code, but they provide convenience, abstraction, and flexibility. Just as in the data center, they simplify the software development task at the expense of some performance.

TensorFlow Lite still assumes, at the very least, an operating system for retrieving files containing the abstracted models, as well as memory allocation for use in storing models, parameters, partial calculations, or anything else that may need to reside in-system for the duration of the computation.

An even smaller TensorFlow Lite
The systems TinyML targets have none of that, and so TensorFlow Lite has been modified yet again into TensorFlow Lite for Microcontrollers. While this is still an interpreter, it no longer relies on a file system and dynamic memory allocation. It provides abstraction one level above native code.

Instead of placing the model in a separate file, the model is compiled into a C byte array that can be stored in the read-only portion of the program. That gives the program access to it while not requiring the program to read an external file.

If memory management is needed, then rather than relying on the OS, manual techniques are required to allocate and use memory. Any C or C++ libraries are statically linked at compile time, making it unnecessary to keep dynamic libraries around.

The result is a small interpreter that takes up less than 16 KB of memory on an Arm Cortex M3 system.

There also are broader limitations beyond simply restricting model sizes. TensorFlow Lite for Microcontrollers doesn’t support the full set of TensorFlow operations. And it must be “ported” to individual devices, meaning that at any given time there may be a limited set of supported processors.

It is possible to compile all the way down to native code, which will be more efficient for both performance and power. “If you just compile everything onto the controller, even at the bare metal level, it will be more energy-efficient,” said Anoop Saha, market development manager at Siemens EDA.

The question now is how small this can go. “More people are trying to get the runtime out, especially for very low-resource environments, because it takes up so many kilobytes, and then you don’t have enough left for your networking,” observed Verhelst, noting that compilation may be complicated due to the fact that each system may have its own compile flow. “People have tried to make a dedicated compile chain that you pre-compile, download on your system and just execute that predefined flow. But that often requires quite a lot of work on the compilation tool chain.”

There are efforts to find ways to standardize such flows, which will remove some of the friction from the compiled approach. “You can re-use part of existing compiler tool chains and plug in a custom back end,” she explained.

For models originally developed on a framework other than TensorFlow, Open Neural Network Exchange (ONNX) formats can be used to adapt the model to TensorFlow Lite for Microcontrollers.

Fig. 1: Different versions of TensorFlow operate on different systems. ONNX can be used as a bridge from other training frameworks. Native code is also an option for edge implementations, regardless of resource constraints. Source: Bryon Moyer/Semiconductor Engineering

Focus on inference
While many ML engines allow for inference and training — particularly in the cloud, most edge devices do only inference. That’s a natural fit for systems that will be executing their AI mission rather than participating in the development of the AI model.

That means the edge systems would be developed in the cloud and then implemented at the edge — although, for some of these smaller systems, it may be possible to run the training on a well-appointed desktop system. But that assumes a process where a model is trained once and then executed. There is a notion of incremental learning, where an edge system can do its normal inference work while taking advantage of the data that it sees to improve its own algorithms.

Given that these systems will likely be too small to handle that training aspect, they may need a connection to a beefier computer (or set of computers), either on-premise or in the cloud, to handle that training duty.

“The practical implication of this is that you will end up with good models that work well on the edge, but can’t necessarily adapt to new data without having some form of cloud connectivity,” said Infineon’s Harsha. “The expectation is that of a hybrid approach, where models will be running on the edge, with a small subset of data being sent to the cloud for continuous refinement of the model throughout the lifecycle of the product.”

Without incremental training capabilities, such systems would need to rely on over-the-air (OTA) updates for model improvement. In systems that can handle files, that could mean simply replacing one model with another. It becomes more complicated, however, if the model is compiled into the engine as a byte array. Then the entire program would need to be replaced — not just the model data.

Developing applications in the TinyML world
It’s well known that a model created for use in the data center will undergo some compromise when being ported down to the edge. At the very least, floating-point representations are more likely to be quantized into some integer format. And it’s well accepted that such steps may reduce model accuracy. Engineers often work hard to retrain and iterate to recover or minimize that lost accuracy.

The same is likely to be true to an even greater extent with TinyML-sized systems. That raises an important question: how is one to know which applications can be reasonably handled in a small embedded system?

“You’re not going to run some massive machine-learning model on a tiny little microcontroller,” observed Synopsys’ Watson. “So there are things that it can do and things that it can’t.”

A working assumption is that developers will not be creating models from scratch. “That’s a data-scientist job,” said Campbell. “And they’re not using embedded devices. They’re using big workstations to design the model, train it, and get the accuracy that they’re looking for.”

The challenge is adapting known models. In the TinyML world those models often are shared openly, so there may be models available off-the-shelf.

Nevertheless, just because one person got a model to work for their constrained system doesn’t mean that it will work for everyone’s. Profiling tools allow developers to understand the resource costs of a model. Knowing the capabilities of the chosen processor and the available memory then helps to determine whether a given model can work on a given system.

“You usually have an idea of the network that you want to run,” said Campbell. “You can profile it and figure out how long it takes. You also know the particular device you’ve chosen, so you can make an estimate and say, ‘If I run my SoC at this many megahertz, I will be able to run this many inferences per second of this particular graph.’”

There are other considerations, as well. “Besides performance, there’s also memory size,” added Watson. “You’ll be able to estimate how much memory the model needs.”

That determination must be made in full view of all of the software that will run on the system, not just the ML. Other system necessities, like security, can themselves tax the resources, so it all has to fit in there together.

Proliferating and democratizing
The range of applications that a TinyML system can handle is growing. Part of that growth comes from improved ways of doing the computing. Some comes — and will continue to come — from the ongoing increase in computing power available at this level, thanks to Moore’s Law and more-than-Moore efforts.

“Because of Moore’s Law, every couple of years we have more powerful chips,” observed Saha. “So if you just wait it out, we’ll be able to run an algorithm on tiny devices.”

At present, typical applications include activity monitoring, gesture recognition, preventive maintenance in motors and HVAC systems, and basic audio events like identifying a wake word or a specific sound like breaking glass or a baby’s cry.

MLperf, the organization developing ML benchmarks, has a “Tiny” group that has split off to define benchmarks that are appropriate for TinyML-sized systems. The prior benchmarks require far too much in the way of resources, so further work should make it possible to compare ML results in these extremely small systems. The first benchmarks are just now coming out.

There are numerous aspects to the work being done under the TinyML umbrella. In what can be weekly meetings, members present work that can consist of new tools to automate tasks, new optimization approaches, and new optimized models.

Part of the motivation of the entire movement comes from the realization that, for developing ML, data is king, and whoever has the best data can develop the best models. That gives companies like Google and Facebook outsized power to completely dominate the technology.

“That’s what Big AI creates,” said Gousev. “They create this inequity gap. The big guys own the data, they have their servers, and so they basically control the world.”

The TinyML effort is partly about democratizing AI. It brings tools, knowledge, and resources to small companies, and even individual developers in a way that gives them a fighting chance against the AI Goliaths.

“Instead of the big guys owning the whole thing, the guy who runs the farm, or the guy who runs the supermarket, or the guy who runs the factory — they do it locally,” said Gousev. “Anyone who doesn’t have a background in data science or special math training or special programming can develop their own stuff.”

Having lots of Davids also means market fragmentation, as a host of small companies battle it out. That is interesting in that it suggests a real free market, with numerous players, none of which can dominate by sheer imposition of size. Each company has to win based on its merits, and developers will have a range of tools and platforms from which to choose.

But fragmentation has its downsides. “The problem with TinyML is that the market is so fragmented that it cannot be economically viable to build a single IoT-specific chip,” Saha pointed out. According to him, that may mean composing systems out of smaller sub-system chips for things like the radio or security.

When it comes to ML, he notes, “The better choice is to build an accelerator, which will be cheaper, and the chances of success are higher and better as well.”

While TinyML may sound like a hobbyist’s platform for use after hours, a number of big companies have people working full-time on TinyML projects. It’s very much fueled by professionals, not just tinkerers.

“The next phase is going to be scale,” predicted Gousev. “And to scale, you need to have killer apps.”

What’s next
While there’s plenty of activity developing basic technical capabilities today, the organization shows no sign of declaring “mission accomplished” anytime soon. As the technology matures, attention is likely to rise up to focus on business and additional ways to leverage what has been learned.

“We have algorithm developers, ecosystem people, software people, application people, so the center of gravity will be moving from the technology into the business application side of things,” said Gousev.

Even within technology, however, work will remain. “There are more than enough challenges,” said Verhelst. “And the push will always be to lower and lower power.”

Once the milliwatt power range has been conquered, there will be the 100nW range, followed by the 10nW range, with no end in sight. Form factors also are likely to shrink, tightening the constraints. And new applications will bring new demands.

All of this is likely to keep TinyML work going for many years, with some real impact. “We fully expect that TinyML is going to shape how next generation edge devices will be architected,” said Harsha.

Making Sense Of New Edge-Inference Architectures
How to navigate a flood of confusing choices and terminology.
Edge-Inference Architectures Proliferate
What makes one AI system better than another depends on a lot of different factors, including some that aren’t entirely clear.
Have Processor Counts Stalled?
Have chips reached a plateau for the number of processor cores they can effectively make use of? Possibly yes, until you change the programming model.


Huw Davies @Trameto.com says:

At operating powers of milliwatts and below, TinyML represents a huge opportunity for battery-less operation using micro energy harvesting and novel PMIC devices.

Leave a Reply

(Note: This name will be displayed publicly)