Inferencing and some training are being pushed to smaller devices as AI spreads to new applications.
AI is becoming increasingly sophisticated and pervasive at the edge, pushing into new application areas and even taking on some of the algorithm training that has been done almost exclusively in large data centers using massive sets of data.
There are several key changes behind this shift. The first involves new chip architectures that are focused on processing, moving, and storing data more quickly. These are key design goals in the data center, where custom designs can accelerate data processing and movement, but they are relatively new in edge devices — particularly those connected to a battery — because most of the previous designs are focused on high utilization of a processor, which is energy-intensive.
However, by partitioning these designs into different functions, and by using algorithms that are sparser and forcing more of the weights to be zeroes, significantly less energy is required to perform computations. That, in turn, opens the door for these architectures to be used across a much wider variety of applications and markets.
“There is very rapid evolution happening at the edge,” said Cheng Wang, senior vice president of software and CTO of Flex Logix. “If you look at 5G, it’s moving very much toward AI neural networking. Channel estimation used to be all done by DSP processors. Now people are talking about running CNN graphs. With 5G, most of the secret sauce is in the channel estimation — estimating the channel utilization and then compensating to account for the distortion of the signal that has occurred in the channel. A lot of this has to do with how you intelligently create the most accurate channel model for the signal so you can have the best high-quality reconstruction. Now that’s moving toward AI.”
Fig. 1: Identifying objects in real-time at the edge. Source: Flex Logix
This is basically adaptive filtering based on vector matrix multiplication. “Right now, 5G has not been fully deployed because it’s finicky,” Wang said. “Most of this high-bandwidth technology needs line-of-sight and a lot of localized radio units to provide good channels and high-bandwidth communication. But now they’re going to use small, localized radio units that are supposed to be everywhere, not like the towers on the side of a freeway that are trying to serve a large radius of users.”
Data volumes vary
A second shift is a recognition that data volumes can vary significantly, depending upon the application. In fact, some edge devices actually can be used to train simple algorithms, rather than relying on complex training in data centers.
“We’re talking about training on simple sensor data,” said Kaushal Vora, senior director of business acceleration and global ecosystem at Renesas. “These are not typically large deep learning models. These could be mathematical models, models based on signal processing and things like that, which are still doing machine learning. Those can be easily trained with a very low power budget on the fly. For example, models tend to drift sometimes with environmental conditions, which may not have been taken into consideration while training. Simple drift correction can be done at the edge by training.”
Others point to more ambitious training goals at the edge. Tony Chan Carusone, CTO of Alphawave Semi, predicts vendors may create lightweight training chips or “inference-plus” chips, which would be inference chips with additional components that could perform training. “Chiplets could be the perfect enabler because you can generate a variant that has a little more memory to accommodate such a use case, without having to re-architect everything from scratch. You could mix and match existing things to make a lightweight version of the same training chips that are being used back in the hyperscale data center, perhaps with less memory or fewer compute tiles, with low energy costs.”
Nevertheless, the basic distinctions are unlikely to meld. “There’s a lot of difference in the training and the inference,” said Chan Carusone. “The jobs are so massive, it already takes months to train some of the most useful models, even on a massively parallel infrastructure. Whereas for inference, usually the emphasis is on speed or responsivity or latency, how quickly can you produce an answer.”
In most cases, though, the real value at the edge is in the inferencing. Most training still is performed in the data center with the models, then hyper-optimized to allow for more computationally efficient inference at the edge. As a result, tailored hardware is becoming increasingly important for inference.
“We’re going to be seeing inferencing all around,” said Russell Klein, program director at Siemens EDA. “And the challenge is, do you do that inferencing on something very general purpose or do you customize it?”
The answer depends on the application. “The more customized you go, you can create a device that’s smaller, faster, and lower power,” Klein noted. “But you’re exposing yourself to the risk that a new use case might come up, or a new algorithm, and now you can’t change that out in the field. For example, we might create an object detection algorithm, with a convolution that has a 3 x 3 filter baked into the hardware, and it’s never going to be able to support a 5 x 5 filter. By contrast, a GPU will support any number of filters, but it won’t be as small and power-efficient.”
And this is where the edge definition can be very confusing. The edge spans from the end point all the way to the cloud, and the number of possible applications and permutations is enormous. It includes everything from networking infrastructure to devices that run on a small, rechargeable battery.
“In the networking world, it’s all about moving large amounts of data as fast as you can through a pipeline with an absolute minimum amount of latency, a minimum amount of dark silicon, at a maximum utilization. That describes both networking and what AI inference needs to be on the edge,” said Paul Karazuba, vice president of marketing at Expedera. “It needs to be able to move large amounts of data through a comparatively small pipeline extremely fast, at an extremely high utilization.”
At the same time, edge implementations need to overcome form-factor limitations and environmental fluctuations that aren’t even considerations for warehouse-sized, climate-controlled data centers. “At the extreme edge, your biggest constraint and challenge is to do useful computations while sipping the least picojoules of energy from a lightweight battery,” said Chan Carusone. “Another big constraint is cost, because these are often consumer devices.”
These devices also need to be able to address generative AI as it moves to the edge, where the growing appetite for transformer models versus older convolutional neural networks (CNNs) is creating future-proofing challenges for designers.
“If your architecture is not capable of moving seamlessly between the two, then it’s no longer going to be successful,” said Arun Iyengar, CEO of Untether AI. “CNNs and DNNs are compute-bound networks, in which TOPS matter. By contrast, generative AI is memory-bound. The actual application that needs to happen pales in comparison to the amount of data that you need to bring in through your memory interfaces. Your architecture looks very different. If you were to just focus on CNNs, then you’d find that you can’t stream the data because you don’t have enough high-bandwidth links to memory, which means there’s no way you can do anything meaningful on a large language model.”
Transformers are a midway point between the two, and include compute necessities like attention heads, which are a key part of a neural network’s “attention mechanism,” that directs the network’s focus and helps it derive patterns. “Transformers during training can learn to pay attention to other pixels,” said Gordon Cooper, product marketing manager for the embedded vision processor at Synopsys. “Attention networks have greater ability to learn and express more complex relationships.”
Transformer attention heads are done with a lot of computational throughput requirements, but they also pull in a lot of data, which is going through the memory side of the equation. “The bookends are CNNs on the compute side and large language models on the memory-bound applications. Transformers, like the NLPs, fall in between. You have to work through that spectrum and figure out what choices you’re going to make,” Iyengar said. “For example, when creating a CNN engine, your goal is to have memory very close. You’re streaming a lot of activity coefficients, and have all the weights resident, so you can run through a lot of these pixels as quickly as you can. The goals change when creating an LLM. You’re focused on mega highway lanes to do memory, which is also very fast and very deep. Bandwidth and the density become necessary, so you can pull data from outside as quickly as you can. For a simple comparison, if you did a recording on a CNN-class engine, then what you talked about in the last 20 seconds would be all that’s available. But if you did a truly large numbers model with the higher bandwidth and higher density for memory, which is what a memory-bound chip would do, then you could review the entire conversation.”
The cost of moving data
The third big shift falls under the heading of the cost of moving data, which is the whole reason why the edge is exploding. It costs resources, energy, and time to move data, and it raises the risk of data leakage or theft as well as privacy issues. So what can be done locally on a device is important.
Chan Carusone believes reinforcement learning can play a role at the edge along with other approaches, especially in situations involving proprietary data with increased privacy and security concerns. “There’s going to be more demand for that, because not everyone’s going to make their own massive generative model. Instead, they may get licenses and build their own refinements to that model tailored to their proprietary data and their use cases. There’ll be massive models constantly being improved and retrained in big hyperscale centers, with refinements running in other places, on data that companies don’t want to risk sharing. As that gains steam, you’re more likely to see more efficient methods at the edge.”
Still, models running at the edge can actually require more intensive training. “To do a large language model, training is important and expensive,” Iyengar said. “But to do a small model, training is even more important, because you have to narrow down the parameters that are absolutely necessary. Whereas with a large language model, if you’ve got 1.8 trillion parameters, it’s okay if you miss a few. Your deployment, which is an inference, is going to happen in a really small box. It needs to be very compact and able to do what’s necessary, so training becomes a little more challenging for the edge application than it is for the traditional large language type models that are within the data center.”
The future
While the rise of edge AI is a significant trend, there always will be use cases that require the power and data consolidation of a data center.
“If you’re looking for patterns, you have to do it collectively,” noted Steve Roddy, chief marketing officer at Quadric. “Take so-called ‘porch pirates’ for example. If I’ve got 10,000 cameras reporting in, I can figure out where thieves are going door-to-door stealing packages and alert the police to where they’re likely going to be next. Ultimately, it’s going to be a co-dependency. You’re going to be able to process signals where they occur and you’re going to be able to analyze patterns in the cloud.”
The challenge will be in figuring out what works best where, because over-specificity can cause issues, as well. “One tradeoff to consider is that you don’t typically get a good view of what’s happening all the way across everything in the environment,” said Nalin Balan, business development manager at Renesas. “If you learn at the edge from one specific device, can it generalize across all other endpoints or devices?”
In the end, everyone is both optimistic and a bit awed by the rapidity of this new field. “If there’s a large language model that can’t be done at the edge today, I can guarantee that eventually there’s going to be a smaller version of that LLM that can be inferenced as far away at the edge as possible,” said Untether AI’s Iyengar. “I’m a big believer that the inference area has no limit and we’ll have a way around limitations either through a model level or through a chip architecture level.”
Related Reading
AI Accelerator Architectures Poised For Big Changes
Design teams are racing to boost speed and energy efficiency of AI as it begins shifting toward the edge.
AI Accelerator Architectures Poised For Big Changes
Design teams are racing to boost speed and energy efficiency of AI as it begins shifting toward the edge.
A Packet-Based Architecture For Edge AI Inference
Increase NPU utilization by optimizing the flow of activations through a network.
AI Transformer Models Enable Machine Vision Object Detection
A system-level view of machine vision will be essential to move the technology forward.
Leave a Reply