Edge-Inference Architectures Proliferate

What makes one AI system better than another depends on a lot of different factors, including some that aren’t entirely clear.

popularity

First part of two parts. The second part will dive into basic architectural characteristics.

The last year has seen a vast array of announcements of new machine-learning (ML) architectures for edge inference. Unburdened by the need to support training, but tasked with low latency, the devices exhibit extremely varied approaches to ML inference.

“Architecture is changing both in the computer architecture and the actual network architectures and topologies,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence.

Those changes are a reflection of both the limitations of scaling existing platforms and the explosion in data that needs to be processed, stored and retrieved. “General-purpose architectures have thrived and are very successful,” said Avi Baum, co-founder and CTO of Hailo. “But they’ve reached a limit.”

The new offerings exhibit a wide range of structure, technology, and optimization goals. All must be gentle on power, but some target wired devices while others target battery-powered devices, giving different power/performance targets. While no single architecture is expected to solve every problem, the industry is in a phase of proliferation, not consolidation. It will be a while before the dust settles on the preferred architectures.

ML networks
To make sense of the wide range of offerings, it’s important to bear in mind some fundamental distinctions in the ML world. There are at least three concepts that must be distinguished — an ML network, an ML model, and a hardware platform intended for implementing ML models.

An ML network is a physical arrangement of layers and nodes, along with their connections. Researchers spend a lot of effort coming up with new networks that might work better than older networks. These networks usually have names like ResNet, MobileNet, and YOLO.

The network establishes the nature of the work required to make decisions. Most developers and researchers, for example, have found that performance is limited by memory accesses — the so-called “memory wall.” By contrast, Google has said that its tinyML network is compute-bound rather than memory-bound.

“We don’t need more memory accesses,” said Pete Warden, technical lead of TensorFlow Micro at Google, during a recent presentation. “We load a few activations and weight values at a time, and then do a lot of different combinations of multiply/adds on them in registers. So that’s not a great fit for traditional hardware because most problems tend to be memory-bound.”

There are two broad categories of network being commercialized today. The most prevalent is the “artificial neural network,” or ANN. These are the familiar networks styles like CNNs and RNNs, where all data moves through the network. “The convolutional neural network is the workhorse of a lot of what we see in the AI space today,” said Kris Ardis, executive director of the Micros, Security and Software business unit at Maxim Integrated.

The second category is neuromorphic. Spiking neural networks (SNNs) focus more on changes to an existing state than constantly processing the entire state. “People believe that that SNNs are the next step because they’re going to need to get hundreds, if not thousands, of times better than they are today,” said Ron Lowman, strategic marketing manager, IP at Synopsys. “Being able to record previous knowledge is what spiking neural networks are trying to do, whereas CNNs massively process all that data constantly to make sure they have the correct answer each time.”

At present, BrainChip and GrAI Matter are among the few companies offering an SNN. “Akida follows neuromorphic design principles, which distribute all the computation so that it is spread across multiple neural processing units (NPUs),” said Anil Manker, CEO and co-founder of BrainChip. “Each NPU has its own memory and computation media.”

Which network is used has a profound impact on performance. This is why benchmarks such as those developed by MLPerf specify a network like ResNet-50, so that performance comparisons aren’t confounded by different networks. The challenge is that benchmarks take a while to develop, so at this point they tend to use smaller networks than what are commercially attractive.

Companies with new hardware want to prove they can meet their customers’ performance needs, and benchmarks would be one way to do that, allowing apples-to-apples comparisons between architectures. But running benchmarks takes time, so companies often focus their testing work on networks that their customers may want to use, which may be different from networks used for benchmarks.

“Unlike a lot of edge AI inference chips in a similar cost and power category, we are optimizing for large models and megapixel images that the customer actually wants to run, not a 2.4-by-2.4 MobileNet,” explained Cheng Wang, senior vice president, software architecture engineering at Flex Logix.

That means many of the new devices won’t have official benchmark numbers. It’s not necessarily that the companies are intentionally steering clear of the benchmarks. In some cases, given limited resources, they’re prioritizing other networks, coming back to the official benchmarks if and when they have time.

ML models and underlying hardware — and software
Networks are like the scaffolds on which an actual design can be implemented. Once we take such a network and train it for a specific task, we end up with a model. A model is a network with weights. Whereas a network may be created by a research team after lots of effort, a model is created by a development team designing a specific application. A single network can be used for many different models.

Models typically have been built generically, using floating-point math, after which they are adapted for a specific hardware platform. That adaptation may involve moving from floating-point to integer math of some level of precision, and it also may include compilation optimization steps that adapt the model to the hardware. So in the space between the generic model and the hardware platform, you end up with a model that’s been optimized for one kind of hardware.

A key consideration for anyone developing a hardware platform is the math precision to be used. All other things being equal, more precision — and floating-point if possible — makes for better accuracy. But for many applications, there is such a thing as “enough accuracy.” Adding more hardware precision means adding more circuitry, which chews up more silicon area, thereby increasing costs. If by adding all of that cost, accuracy of a model went from 97.2% to 97.4%, then it might not be worth the cost.

Google asserts that 8 bits is enough for practical purposes. “We’ve got lots of evidence that 8 bits is plenty,” said Warden. In fact, even smaller representations can work. But he noted that FPGAs are really the only hardware that can implement any random precision. Everything else defaults to 8-bit (or a multiple thereof). That means most software efforts also have focused on 8 bits, making it harder to optimize a model for something smaller.

The more traditional approach historically has meant that a model, after being modified, might need to be retrained. This is because, although the model was based on a specific network, the network configuration might change during the optimizations. Retraining would then help to replace some accuracy that might have been lost during adaptation.

Some of that rework can be eliminated by training in a manner that’s aware of the underlying hardware. That can lead directly to the implemented model, bypassing the generic model. But this is a function of the design tools available for a given hardware platform.

Fig. 1: Well-researched networks are available as choices for an application. One is selected, and the training process creates a model. Selecting inference hardware and then optimizing the model for it creates a final production model for deployment on that hardware. The dashed line indicates a hardware-aware training flow, which can bypass some or all later optimizations. Source: Bryon Moyer/Semiconductor Engineering.

Fig. 1: Well-researched networks are available as choices for an application. One is selected, and the training process creates a model. Selecting inference hardware and then optimizing the model for it creates a final production model for deployment on that hardware. The dashed line indicates a hardware-aware training flow, which can bypass some or all later optimizations. Source: Bryon Moyer/Semiconductor Engineering.

Design tools are likely to be dominant forces in this space. Historically, FPGA companies and divisions have died based on this truth. It’s very possible that machine learning will follow that same path. “How you use the hardware is really up to the software,” said Cadence’s Mitra. “And unless you have good software, you really cannot have a good play.”

Software informs the hardware. “We are architecting our devices in a software-centric manner, such that any data flow, any new network framework, is easily ported to the hardware,” said Ravi Annavajjhala, CEO of Deep Vision.

What is the “edge”?
One of the key considerations is where these chips will reside. ML long has focused on cloud implementation, where practically limitless resources can be mustered as needed. But the focus has been moving inexorably away from the cloud and to the edge.

“Machine learning has been very successful in the cloud, thanks to Google, Microsoft, and Facebook, but we believe that it needs to be complemented with similar capability closer to where the data is obtained,” said Baum.

That said, the word “edge” is overloaded. For some, particularly in the communications world, the edge means the edge of the core network, which is where the local networks start. In the cellular world, it might be the last stop before the wireless link. In these cases, there is still more network between the edge and the end-user equipment.

Alternatively, the edge refers to the true edge of the network, which stops at the end-user equipment.

There’s also an intermediate view of something being at the edge, where the term “on-premise” or “local” might be considered an alternative to “edge.” The point is that, even if it doesn’t refer to the end equipment, data doesn’t have to leave the local network for inference. Which definition we use affects the power requirements of any machine-learning engines that will operate “at the edge.” We focus on these latter two in this article.

Power at the edge
It’s always assumed that the data center has unlimited energy available for computations. That’s not true. Saving energy in the data center is important, but for those applications performance is the primary goal.

The edge is assumed to be more energy-conscious. If it refers to a local server, then that server may draw energy from the power grid, making low power less critical. Other end-user equipment may be battery-powered, and so energy comes at a premium. Here, a delicate balance is required between the performance necessary to complete a task quickly enough and the energy it takes to do so.

One benefit of locating inference at the edge lies in saving communication power. If data must be communicated to the cloud for processing, then an enormous amount of energy is expended in transmitting that data. “Even with Bluetooth with its lowest power, you consume 3 to 10 milliamps average, while computing the same data on-device instead could cost you hundreds of microamps,” said Arabi. By doing the work locally, that energy can be saved.

That means there are several possible targets for energy and efficiency, depending on the location of the inference. If the inference is done on a battery-powered device, then the hardware must be especially miserly. If a local server is used for the heavy lifting, exchanging data over one low-latency hop with battery-powered equipment, then the engine gets to run on line power. The following graphic provides some power ranges for various grades of “edge.”

Fig. 2: Different power levels for different edge devices. Source: Deep Vision

Fig. 2: Different power levels for different edge devices. Source: Deep Vision

Performance at the edge
There’s also a big performance distinction between the cloud and the edge. The cloud often runs giant batches of images, for example. That makes it practical for an architecture where it may be inefficient to change from one layer to another in a network, because you have thousands of images for each layer over which to amortize the layer-swap cost. In the end, it may have taken a long time to get the results, but you get a thousand different results pretty much at the same time.

That doesn’t work for most edge applications. When streaming video, for example, you get one frame to process completely before you have to start over with the next frame. You don’t get the luxury of running tons of frames from different streams a little at a time. You have to be efficient in processing one stream.

This points to latency as a measure of performance, and latency has to be low for edge applications. This is often indicated by the phrase, “batch=1.” If a hardware platform is optimized for batch=1, that means it has lower latency. This is pretty much a given requirement for edge applications.

That said, latency is a function of the application. The hardware plays a part in determining that latency, but latency isn’t a feature of the hardware alone. So it can’t be used to compare hardware unless the comparison is for the exact same application and model.

There’s this tradeoff between performance and power, and the trick is not to compromise too much in moving to the edge. “We can run some of the largest neural networks at the edge that people currently run today only in the data center,” said Steve Teig, founder and CEO of Perceive. “At data center-level accuracy, we run something like 20 mW, not hundreds of watts.”

Edge tasks overwhelmingly focus on inference
The other characteristic tied closely with edge vs. cloud is the machine-learning task being performed. For the most part, training is done in the cloud. This creates hardware requirement for the back-propagation of weights during training. That is not necessary for inference.

While inference can be done in the cloud, along with training, edge implementations are almost exclusively for inference. There are some architectures that are capable of incremental learning, where the last layer of a model can be retrained at the edge, but that’s less common.

Looking through the new architectures reveals several different themes. At this point, there’s no clear “right” and “wrong.” Each architecture reflects choices. Over time, some of these ideas may lose out to others, so “better” and “worse” may well emerge. But it’s too early for that, especially considering that few of these devices are available in commercial production quantities today.

More than one way to handle an application
Applications often are bundled into broad categories like “vision.” There are a couple of ways in which this can be misleading. First, not all vision applications are alike. While they may all involve convolution, the kind of processing done to identify an animal in a picture (even if unlikely at the edge) is very different from the processing in an autonomous vehicle, with multiple video and other streams fused to make decisions.

“The original task was classification, which was not a very useful application,” said Gordon Cooper, product marketing manager for ARC EV processors at Synopsys. “Then there’s facial detection, or pedestrian detection, or lane-line detection, or street-light detection. And then there’s a whole different class of image-quality improvement, where the neural network cleans up an image.”

It’s also possible that one kind of application may be morphed into another kind for convenience. There is much more vision-oriented hardware available than there is for applications like language and sound. Some companies take non-vision applications and convert them into vision. “You can take what is traditionally two-dimensional, like an audio waveform or radar waveform, and map it into a two-dimensional image,” said Cooper. “Then you feed it through a CNN.” So broad categories are useful, but not definitive.

As to the prevalence of edge vision applications in particular, Cooper noted the efficiency of CNNs. “One of the reasons why vision tends to be more targeted for edge applications is because you don’t have the bandwidth problems that you have with speech and audio,” he said. “LSTMs or RNNs need a significant amount of bandwidth.”

That said, some of the architectures are intended to address many applications. “Although image processing is the obvious heavy lifting, we can support up to two cameras, up to 4k video, multiple microphones, and audio, speech, and stateful neural networks,” said Teig.

Fig. 3: An architecture for full-application use. In this case, there are separate image-processing and neural network-processing resources. Source: Perceive

Fig. 3: An architecture for full-application use. In this case, there are separate image-processing and neural network-processing resources. Source: Perceive

Undisclosed
One surprisingly common theme amongst the announced architectures is that hardware details are not disclosed. This appears largely to be a competitive thing. These architectures reflect many different tradeoffs and ideas, and if a few specific hardware notions yield great success, then the last thing they want is for that to be public. Patents help, and patents are legion in this space. But even so, keeping the competition in the dark is clearly a strategy for some companies.

The only way that can work, however, is if they have a good software story. The basic explanation is, “We don’t disclose hardware because designers don’t need to understand the hardware. The tools abstract that away, making the hardware details unimportant.” That places extra pressure on the software tools to deliver, since only the software can make the undisclosed hardware look good.

Conversely, clever disclosed hardware ideas may not pan out if the software tools don’t leverage them well. So disclosure isn’t necessarily a good or bad indicator. It’s also not all-or-nothing. Some companies disclose to a certain level, but not in detail. Or sometimes only key patented ideas are disclosed as a way of enticing a new user to give them a try.

SiMa.ai is a case in point. “We are doing literally nothing unique on silicon — no analog, no inline memory compute, no nothing whatsoever,” said Krishna Rangasayee, CEO of SiMa.ai. “We want to bring minimal risk to our silicon, so our innovation is less about process technology and silicon. It’s really architectural.” The software hides that architecture, however.

Fig. 4: A processor block diagram that doesn’t disclose much detail, relying on software tools to handle those details. Source: SiMa.ai

Fig. 4: A processor block diagram that doesn’t disclose much detail, relying on software tools to handle those details. Source: SiMa.ai

Dynamic vs. static engines
The classic, old-fashioned way to execute a neural-network model uses one or a few cores. The model is transformed into code, and at any given time, the core is executing code relating to one node or another, in one layer or another. In the limit, a single-core platform would execute every part of the network on that same core.

This, of course, gives poor performance, and the new architectures all propose improvements to this basic case. One major architectural consideration remains: Is the platform one where resources get re-used for different parts of the network, albeit more efficiently than the single-core model? Or is the entire model compiled by the design software and allocated to fixed resources in the chip?

Fig. 5: An example of an architecture where resources are allocated at compile time to different layers of the model, with execution flowing layer to layer within the array of resources. Source: Hailo

Fig. 5: An example of an architecture where resources are allocated at compile time to different layers of the model, with execution flowing layer to layer within the array of resources. Source: Hailo

This distinction has major implications. By re-using hardware it may be possible to provision the chip with less hardware, lowering cost. It also may allow for more dynamic execution — decisions made on-the-fly that can affect the specific execution flow.

“It depends on the narrowness of what these folks are trying to do and whether they have one or two or three graphs — and they’ve tailored something for those graphs – or whether they’re trying to be all things to all people and be more programmable and less area-optimized and power-optimized,” said Cooper. The cost lies in reconfiguration and data movement.

Fig. 6: This architecture touts “polymorphism,” with different flows dynamically allocated according to the needs of each part of the calculation. Source: Deep Vision

Fig. 6: This architecture touts “polymorphism,” with different flows dynamically allocated according to the needs of each part of the calculation. Source: Deep Vision

Fully programmable or not
Another distinction that some device-makers will tout is “full programmability.” All of the architectures feature some form of programmability, so we’re not talking about “fixed-function” versus “programmable.” Instead, this usually alludes to how flexible the computing elements are.

If the engine is largely created out of MACs, those elements do one thing — execute multiply-accumulate operations. The architecture in which they’re embedded presumably will allow some sort of static or dynamic definition of what data an individual MAC will operate on, and that configurability can be considered programmability because the relationship isn’t fixed in the hardware.

At the other extreme, some architectures employ full-featured processor cores in their architecture, sometimes with very specific custom instructions in addition to a full complement of standard instructions. These architectures claim to be “fully programmable,” because the processor can do more than multiply-accumulate, for example. It can be helpful for situations where the same processor may be used for non-neural portions of an application.

Fig. 7: An architecture that combines both fully-programmable and function-specific (shader) blocks. Source: Think Silicon

Fig. 7: An architecture that combines both fully-programmable and function-specific (shader) blocks. Source: Think Silicon

There are other architectural notions to explore in more detail. Part two of this series will take a closer look.

Related
Faster Inferencing At The Edge
Balancing better performance through hardware-software co-design with flexibility in design.
Winners And Losers At The Edge
No company owns this market yet — and won’t for a very long time.
Why TinyML Is Such A Big Deal
Surprisingly, not everything requires lots of compute power to make important decisions.
Power Models For Machine Learning
Predicting the power or energy required to run an AI/ML algorithm is a complex task that requires accurate power models, none of which exist today.
Customization And Limitations At The Edge
Experts at the Table: Why power, performance, and security will define edge devices.
AI Inference Acceleration
Understanding what’s important for making tradeoffs.



Leave a Reply


(Note: This name will be displayed publicly)