Architecting solutions for edge AI is not about minimizing cloud solutions or making small extensions of existing MCUs/MPUs. It’s a hardware/software/model co-development problem.
Key takeaways
Implementing AI on the edge is driven by a different set of metrics than training or even inference in the cloud. It makes power a first-class citizen, if not the most important design consideration, and it cannot be done successfully without co-design.
Almost all the attention associated with AI goes to training in large data centers, where power is important because of the sheer amount being consumed. It demands the very latest in cooling and packaging technology, and has driven most of the research and development within the semiconductor and EDA industries. But on the edge, where inference is turning those trained models into devices that add value, things are very different.
Power is still important, but now every milliwatt counts, especially when the device is powered by a battery. Size matters, cost matters, and complex tradeoffs are often necessary. While edge AI has not received the same amount of attention, this is where much of the money will be made eventually, and it is a vital part of the long-term health of the technology.
Edge AI is seeing a growing number of applications. “Adoption is prevalent in areas such as industrial automation,” says Kavita Char, principal product marketing manager at Renesas. “Applications include motor control, where you would see use cases like predictive maintenance, and anomaly detection on a factory floor, and building automation, home automation, and smart homes. We are seeing a lot of consumer products, such as smart thermostats, video doorbells with intelligence built in. We are also seeing this in wearables. Then there are new segments like smart cities, which include systems for traffic management and pedestrian crossings.”
The level of demand for AI is clearly different for each type of device. “Higher performance implementations of edge AI are typically very focused towards the specific application and use case of the device,” says Paul Karazuba, vice president of marketing at Expedera [at the time of this interview]. “That would be something like a car or a smartphone. The less performant applications are typically much more generic. That would be something like a smart refrigerator, or a general consumer device. Certainly, there are going to be exceptions to that rule, but in general, the more performant, the more application-focused the particular design is.”
In the cloud, power is scalable. “You can add more memory, add hardware and cooling, and everything that is required to implement these AI systems,” adds Renesas’ Char. “But on the edge, there is a physical limit to what can be accomplished, and that depends on the specific device that’s going to power this AI application. Some key areas of concern are that many of these edge devices are very power-sensitive. These are battery-operated, and they have limited power budgets, so the challenge is running complex AI models.”
Those power restrictions spread to other concerns. “Edge devices operate under tight energy and thermal limits, which make more efficient power conversion a critical requirement,” says John Demiray, senior product marketing manager for Microchip Technology’s Analog Power and Interface Division. “Further, restricted board space requires form factors that push power density higher. Another key design implication is cost-effectiveness for volume deployment. High volume edge designs need to balance BOM and assembly cost against efficiency and integration.”
Higher energy density means more thermal problems. “Thermal dissipation becomes a big deal for them, because they are not able to put big heat sinks on these chips and devices,” says Suhail Saif, director of product management and solutions engineering at Keysight EDA. “In general, they are fanless environments. These constraints push the design teams in different directions and require them to make difficult architectural choices.”
All aspects of energy become important design considerations. “It goes well beyond ensuring that the entire system works to minimize energy per inference,” says Qazi Faheem Ahmed, principal product manager at Siemens EDA. “That impacts battery life, and thermals depend on it. With little cooling and tight form factors, managing hotspots becomes just as important as cutting total power, especially as workloads jump from idle to bursts alongside radios and sensors.”
Design considerations
It is not possible to simply migrate a model from training to inference or from cloud to edge. “Inference is, algorithmically, a subset of training, but in terms of the deployment constraint, it’s vastly different,” says Sharad Chole, chief scientist at Expedera. “Training is, for the most part, done in the cloud, and the models are distributed across multiple GPUs. It’s data parallel training, and there is no latency sensitivity. Things are more complicated on the training side. You need to do backward pass, forward pass. You need to gradient update, and this requires a bunch of operations that are not just part of the network, but also derived from the network, or derived from the learning algorithms or learning rate.”
“On the inference side, the first and most important thing is that unless it’s a cloud inference or server inference, we are not going to get any batch opportunities,” continues Chole. “It basically has a batch size of one. And that’s a huge differentiating factor, which means we cannot use batch dimension on the inference side to get higher utilization. You have to come up with architectures which are utilized natively, not by doing data parallelism, but by exploiting the network architecture itself.”
This is an area where everyone is still learning. “Today, we are at the launch pad of inference,” says Kamran Khan, head of business development at AWS’ Annapurna Labs. “We are just starting to mature these applications and maturing the layer of AI as a resource that all application services will be leapfrogging off in the future. DeepSeek was one of the first to show how to do it at scale, and by utilizing new reinforced learning techniques and distillation techniques, that fed into them having better sparse models that were more cost-effective to serve. We are seeing this feedback loop going, feeding each other and creating better models.”
Even if you remove the functions needed for training, you still do not have an adequate starting point for inference. “Designing for edge AI is not about shrinking a data-center solution,” says Guillaume Boillet, vice president of strategic marketing at Arteris. “Instead, it is about co-optimizing compute, memory hierarchy, and interconnect from the outset to extract maximum intelligence from a tightly constrained energy budget.”
Not every company is up for that challenge. “Your choices are to change your architecture, optimize an existing architecture, or maybe lower the performance and throughput in order to meet your power budget,” says Keysight’s Saif. “The smart teams do not start with a given architecture. They start with a power budget. Many companies have an architecture from a previous generation, and that means they can get started immediately. But when you start with a power budget, you have to start from scratch. Some companies are ready to compromise on power using a scavenger hunt, trying to find bits and pieces from the previous architecture that they can scale.”
There are several aspects of architecture that have to be considered. “Architecturally, that means reducing data movement, using efficient AI accelerators, supporting sparsity and low precision math, and relying on heterogeneous SoCs that can sprint and quickly return to sleep, vital for smart cameras and wearables,” says Siemens’ Ahmed. “Efficient memory hierarchies and tuned NoCs help handle burst events like motion triggered inference, while power delivery, packaging, and thermal spreading play a critical role in compact devices such as AR/VR headsets. Firmware completes the picture with batching and aggressive low power modes, much like smart speakers that stay in ultra low power listening states until needed.”
One of the most important aspects is memory. “More power is consumed by moving data in and out of the device with external memory than the actual compute itself,” says Char. “The actual compute is important, and does consume power, but it’s more about firing up those pins to talk to external memory. That’s consuming more of the power. We do many things, and memory management becomes one of the focus areas for us to look at when we try to design these AI devices.”
The implications of that are enormous. “It is more power efficient to recompute multiple bytes of data than loading just one byte of data from memory,” says Saif. “Data movement is coming into focus within the software world. It’s about awareness. I am beginning to see the industry wake up to this reality. It is bringing a mindset change to the designers, to software designers as well as hardware designers. They now have to be mindful of everything they are designing from a power point of view.”
When multiple processors are involved, the entire memory hierarchy has to be considered. “In this context, the network-on-chip becomes a primary determinant of efficiency rather than just an integration fabric,” says Arteris’ Boillet. “Topology, bandwidth provisioning, arbitration, and quality-of-service mechanisms must align with real traffic patterns. Overprovisioning wastes dynamic power, while under-provisioning forces higher margins and risks latency violations.”
That extends out to model considerations as well. “A lot of the designs are becoming more SRAM-centric, meaning we try to pack as much memory on-chip as possible to reduce the movement of data between the device and external memory,” says Char. “If the model fits into the on-chip memory, that is the lowest power configuration that a customer can use. The moment you go to external memory, that’s where you start seeing a spike in power consumption. Users have to be very careful about this. It requires optimization from the user side, as well. They have to look at what models they’re using and if they want to use weight compressions, which basically reduces the size of the model so that they can fit into the chip itself, rather than have to go into external memory.”
Many of the hardware development techniques used in the past no longer work. “When it comes to power savings, nuance matters,” says Ahmed. “Fine-grain clock gating inside dense MAC arrays often costs more than it saves, but coarse tile-level gating or operand isolation can meaningfully cut switching activity. And because edge workloads tend to be bursty, the interconnect must handle spikes gracefully.”
There are some applications where a larger amount of processing power is needed, and cost pressures are not quite as intense. “In autonomy-driven applications like cars or robots, there is a push towards higher TOPs, primarily because of how the workloads have been changing,” says Chole. “We are looking into more transformer-based and convolutional fusion architectures, where end-to-end autonomy and decisions can be made. That requires almost a peta-ops worth of computing power on the edge. The only way to practically build it is by using chiplets. Even if we don’t have device-to-device communication, we still have to worry about chiplet-to-chiplet communications. And within the chiplet you might have a multiple-core NPU to drive the total capacity of operations higher, and that communication also becomes important.”
Cost and performance still have to be balanced. “When you’re talking about the edge, you’re talking about small devices,” says Jeff Tharp, product manager at Synopsys. “It starts becoming about cost for the given bandwidth, and because of that you’re starting to see a lot of chiplet designs. You can now have a system that is broken up into chiplets and you get yield improvement. This is another side of the conversation beyond memory and bandwidth — cost reduction to enable more commodity products in the future.”
Co-development of hardware, software, and model
The model is one piece of the system that can be co-developed to fit the available resources. “There are customers that can control their model,” says Chole. “They have applications and in-house deployments of those applications. These developments are easier to manage because the algorithms that are being developed are still in-house. The architects have to define the subset of an operation that needs to be supported, or the precision that needs to be supported. On the other hand, there are customers who cannot guarantee the models that are being deployed on the SoCs. These models will evolve, and you do not have 100% control over how these models will evolve. Even if you control the model, you still don’t have control over how they will evolve. That makes it more challenging.”
Hardware-software co-design is becoming essential. “Hardware software co-design is easier said than actually done,” says Saif. “An industry that goes with open source, where everything can be co-developed by multiple industry partners, may realize later that not having complete co-design from the beginning hurts later when everyone is pushing the limits of power and performance. Hardware efficiency is certainly more in focus, where the hardware designers are pushed to optimize their hardware design so that they’re running software more efficiently. But in the software world, there is more urgency to optimize the hardware calls that they’re making. They are attaching power costs to every hardware call. Some are much costlier because they need a lot of data to be recalled from memory or more toggles through the hardware, and they’re able to attach an approximate cost to every hardware called in the software.”
There are software and tools that need to be co-designed, as well. “When you are thinking about hardware, you provide ways to enable power reduction options,” says Char. “For a designer of a chip, they’re looking at both sides, and there has to be this hand-in-glove discussion and decisions being made on the design, based on the software ramifications as well as hardware ones. We provide tools for our software developers in order for them to more efficiently use some of the capabilities of the hardware. A very key part of the solution is, in addition to the hardware, the software solution.”
“Shift left” is not just a marketing term. “For edge AI, you really do have to start at the architectural stage with realistic workloads, power models, and interconnect behavior,” says Ahmed. “Then you can apply the right power-aware techniques consistently all the way through RTL, physical design, and validation. If those decisions are made early and reinforced throughout the design cycle, you end up with hardware that’s efficient and still has enough headroom to support whatever the next generation of models demands.”
Shift left extends to the model, as well. “Inference will require us to optimize the models in a significant way while maintaining the accuracy,” says Chole. “The techniques here are quantization, distillation, quantization of our training, and reduced low-precision mapping of some of the operators. You might want to do things that reduce the bandwidth, like activation management and weight compression. And finally, how do you build an architecture that is power-friendly. You need something that can scale to different applications, but also something that is good at giving you the best TOPs per watt.”
Knowing the application is a good starting point. “Instead of optimizing the design for some theoretical corner, or a scenario you synthetically created, you can use a real-life workload for optimization,” says Saif. “Customers see this as a ‘must’ step for edge computing. Hardware/software co-design is very important for edge devices, and we need to partner with the other side of the hardware-software bridge so that our design decisions are in sync. That way, the overall system can benefit from power savings more than before.”
The more you control, the more you can optimize. “It all starts with memory management,” says Chole. “If you want to do the best possible memory management, you have to completely own the workload. When we design a compiler, the compiler has full end-to-end visibility into the neural network that is being executed. This is not how GPUs work. They operate one layer at a time. But if you do that, you miss out on global optimizations. You miss out on data reuse and data sharing. You miss a lot of reordering optimizations.”
Future-proofing
Even if you control all of the ecosystem, you still cannot fully know the future. Many edge devices have extended lifetimes, and upgrades are a necessary part of the ecosystem. That means you need to provide for changes that are out of your control.
“One of the biggest challenges we have is to look into a crystal ball and say, three or five years from now, this is what the AI models are going to look like,” says Char. “With the current pace of AI development, with AI models changing every few weeks, this is a very hard problem for us. Trying to predict is difficult, but maybe there are things that we can do in our devices to provide some future-proofing.”
But how much headroom needs to be built in? “It’s not about guessing the future model,” says Saif. “It’s about preserving optionality under a fixed power budget and fixed thermal envelope. You can’t go berserk adding lots of things under the name of future-proofing. Future-proofing is a system design problem. It’s not a silicon problem. Is your system capable of functioning within the ecosystems later on?”
It is easy to overdo it. “All devices are in some sense future-proof,” says Chole. “The question is, what performance level do you get for the new operations that you need to do? That changes or limits the algorithm designers, or the network designers, the neural network designers, to choose the right set of operations. When you’re taping out the SoC, you have to allocate the budget. You have to consider how much power you’re going to burn if you are moving to higher precision. These two things prohibit applications from using higher-precision operations, or operations that require external processing units. We have to limit the amount of power it can consume, or we have to limit the amount of area you can consume, how much bandwidth. And that budgeting forces us to specialize.”
There are various ways to do this, some of which do not have to impact performance today. “Since models and workloads evolve quickly, flexibility is critical,” says Microchip’s Demiray. “Platforms should support configurable rails, telemetry, and firmware updates so the same hardware can accommodate new models or kernels without a re-spin. The ability to increase power in phases enables modular, multi-phase regulators and scalable power trees. This capability allows system architects to step up load current and improve transient response across product refreshes as computation requirement grows, thereby avoiding substantial redesigns.”
Conclusion
AI is slowly being adopted on the edge, and as that happens, developers are realizing it is not just about creating good hardware architectures. Instead it is a hardware/software/model co-development problem. This requires new tools, new methodologies, new organizational structures — and a new mindset. While it is possible to incrementally change existing products, and they may have a time to market advantage, they will not be able to compete in the long term.
Leave a Reply