Machine Learning Inferencing Moves To Mobile Devices

TinyML movement pushes high-performance compute into much smaller devices.


It may sound retro for a developer with access to hyperscale data centers to discuss apps that can be measured in kilobytes, but the emphasis increasingly is on small, highly capable devices.

In fact, Google staff research engineer Pete Warden points to a new app that uses less than 100 kilobytes for RAM and storage, creates an inference model smaller than 20KB, and which is capable of processing up to 10 million arithmetic operations per second. Warden, the technical lead of Google’s TensorFlow Mobile team, is trying to push use of machine learning (ML) inferencing down into the middle and low ends of the smartphone market while the market focuses its race for inference superiority on premium smartphones.

Apple changed the landscape of the smartphone market two years ago with the announcement of a neural-network accelerator in the iPhone X’s A11 chipset that is capable of 600 billion operations per second. That essentially pushed machine learning inferencing out of the data center by allowing some of those functions to be completed on the device. The company promised greater security, because less data passed outside the user’s control on its way to the cloud. And it delivered better performance for the FaceID authentication system and other ML apps.

“By running primarily on-device you’re able to show really low latency, which means a really good user experience because you don’t have that round-trip delay,” Warden said.

And this has set off a huge scramble across the semiconductor industry to move more AI/ML inferencing closer to the user in an as-yet poorly defined space known as the edge.

“Edge processing is one of the areas that has people very excited over machine learning,” according to Jeff Miller, product marketing manager at Mentor, a Siemens Business. “There are a lot of interesting approaches people take to chips designed as accelerators, but we’re still in the early stages of development. It can be very computationally expensive, especially if it’s running on general-purpose hardware. It can run on a CPU, but the evolution of this is to make it run at lower power and lower infrastructure requirements in the field.”

Demand is likely to push inference accelerators through even the low end of the smartphone market, and will spread in fits and starts into other markets, depending on requirements of the device or the market, according to the Linley Group’s Guide to Processors for Deep Learning. More than half of new smartphones will include a deep learning accelerator (DLA) by 2023, when the market for all forms of DLA could reach $10 billion, compared to $3 billion in 2018.

“But there’s nothing to prevent developers from creating micro versions of inference models that include just the key functions from the main application, or to keep them from using very compact code to add new functions that could have a huge impact,” said Google’s Warden. “Think how much more useful a sensor or some other inaccessible device would be if you could add voice response with a specific, limited vocabulary to make the interface more convenient.”

Warden made his case for a more minimalist approach to some aspects of machine learning in a blog. He and the long list of high-profile commercial ML developers and academics expanded and lobbied for the idea in two separate conferences—ScaledML and TinyML—that ran in consecutive weeks in March within miles of each another. The recent Design Automation Conference got in on the act with a session on TinyML that included talks by Chris Rowen of Babblelabs, Manar El-Chammas of Mythic, and Scott Hanson of Ambiq Micro, who spoke about their experiences trying to implement TinyML.

Machine learning (ML) traditionally has been considered too resource-intensive to even attempt with embedded systems or IoT devices powered by microcontrollers, DSPs or other high-efficiency platforms, according to Tom Hackenberg, principal analyst for embedded systems for IHS Markit.

“A lot of people simply overestimate how hard inference really is,” Hackenberg said. “Even in large datacenters, 90% to 95% of servers running ML inference modes do it without the help of a Google TPU or other purpose-built inference accelerator.”

The majority of popular machine-learning applications revolve to some degree around image-, object- or voice-recognition—functions the customer-designed inference ASICs do quite well.

“Interest in AI has been growing enough that a lot of people have started doing it on non-optimized machines [with a] subsystem that does multiply-and-accumulate for apps with lower graphics requirements at a quarter the cost of a GPU,” Hackenberg said.

“We don’t know yet what most of these chipsets will look like on their own,” according to Carlos Macian, senior director for AI strategy and products for eSilicon. “In datacenter products, at least, power consumption is the alpha and the omega of what you need to know. The question of heat follows that. “Most of these products have been in the pipeline and are still on the way to tapeout, so we haven’t seen them yet.

Running ML models on scalar hardware is rare, however, because there is a disconnect between the process of designing the hardware and the development of new engineering skills.

“It’s about more than the interface,” according to Steve Woo, fellow and distinguished inventor at Rambus.

Growing numbers
Despite demand for smartphones and ML services aimed at consumers, the need to use devices on or near the edge to filter, pre-process and reduce the volume of data flowing to the cloud will remain a critical issue.

“Even conservative estimates say data volumes are doubling about every two years,” Woo said. “No other technology curve is close. The cloud processing model we’ve all gotten used to can swamp networks from a volume and from a cost standpoint. So the only real recourse is to process that data closer to where the device is located—in the phone or in the camera, or perhaps edge computing datacenters.”

The market for smartphones is large, but the global installed base is only about 5.1 billion devices, according to GSMA. Adding IoT devices that are also good candidates for inference adds another 22 billion devices and growing to the list of potential upgrade targets, according to Strategy Analytics.

The big opportunity is in devices that skew even further down the scale of processing power—embedded or standalone devices powered by microcontrollers, DSPs or custom SoCs. Warden estimates the installed base of such devices is currently about 150 billion, with another 40 billion shipping every year. MCUs may not help sell a lot of newly designed accelerators. Their limited resources would isolate MCU owners to a certain extent, and create a separation from those who run ML on MCUs and those running it on accelerators.

“One metric we’re seeing is increased variation in design,” said Mentor’s Miller. “There is a big trend toward seeing IC designs for specific markets or use cases or product lines. And in more and more of our interactions, we see systems companies looking to field an IoT application or service. They’re not just looking into things to sell chips. These things take a while to come to market, and we’re just seeing the leading edge, so it’s likely to be going.”

MCUs don’t have the graphics-processing muscle to execute a standalone inference model, even after it was fully trained, quantized and otherwise compressed. Given the right set of tasks, such as analyzing numerical or other text-based information rather than data that is heavily graphical or time-sensitive, MCUs and DSPs usually can handle the pressure well, although not necessarily at very high levels of performance.

”In devices like earbuds, where your device may be listening for wakeup words, the most important thing to the designer is to find the lowest-power architecture,” said Dave Pursley, product manager for HLS at Cadence. “That chip will spend most of its time powered down, but that one small part has to be powered up all the time listening for command words. If the list is too long or you’re not careful about space when you’re optimizing the design, you lose on area and power, which is just what you don’t want.”

Skills shortage
Finding people capable of doing this isn’t always easy, though. “There is a real shortage of engineering skills to build and program these solution right now,” IHS’ Hackenberg said. “That’s one reason we sometimes recommend buying Nvidia, which is expensive, but can be cost-efficient because it saves time. And because it’s a platform, you get development tools, instruction—everything you need to make it work.”

Coding for small devices isn’t trickier than usual. But those new to machine learning sometimes misunderstand very basic things, such as where code is running now and where it will go if it has to respond to anything.

“If you have software on a device listening for a wake-word, you’re on-device,” Warden said. “But if it’s set to send things to the cloud when something interesting happens—like asking what to do when it hears a wake word—you can actually enable more cloud interaction rather than focus on running on device. That’s a pattern we actually see very often.”

Related Stories
Data Confusion At The Edge
Disparities in processors and data types will have an unpredictable impact on AI systems.
Designing For The Edge
Growth in data is fueling many more options, but so far it’s not clear which of them will win.
Machine Learning Drives High-Level Synthesis Boom
As endpoint architectures get more complicated, EDA tool becomes key tool for experimenting with different options.
Inferencing At The Edge
Why a different architecture is needed to handle massive amounts of data.

Leave a Reply

(Note: This name will be displayed publicly)