Complex Mix Of Processors At The Edge

Diversity of compute elements proliferates for inference, but the mix varies by application.

popularity

With AI changing so fast, it’s a juggle for companies to ensure they can deliver the best performance now while also future-proofing for unknown AI models or a completely different approach to training and inference that may emerge. There are a slew of options for high-end and budget phones, hyperscalers, and low-cost, low-power edge devices, and while GPUs keep making headlines, many designers are using ASICs, NPUs, MPUs, and FPGAs.

“We see a very logical split between the cloud space and the direction of edge devices,” said Kristof Beets, vice president of technology insights at Imagination Technologies. “Edge is anything that isn’t the cloud or data center, but there’s a variety. For example, a mobile phone, which is very small and has a battery, versus a car, which is an edge device but has completely different thermal and power characteristics.”

While AI training mostly happens in the cloud, there’s a high amount of inference, as well. “There are a lot of very complex, data-heavy networks that have very high-speed requirements. That mostly tends to link to the amount of resources that you can throw at it, whereas if you look at edge usage cases, they tend to be predominantly inference, and they’re more driven by privacy. How do I keep my data, my information, local on the device? Reliability is really about the network connection. Even though everything keeps getting better, it’s not always there. And the responsiveness — and equally, the efficiency on those devices — [is also important].”

Inference workloads benefit from different chip architectures, depending on the environment, said Amol Borkar, director of product management and marketing for Tensilica DSPs in the Silicon Solutions Group at Cadence, which recently introduced an AI co-processor (AICP) designed to work alongside an NPU (network processing unit). “Each implementation serves a distinct role, with tradeoffs between power, performance, flexibility, and cost depending on the use case,” he said.

Borkar pointed to a number of AI inference options, including:

  • GPUs: Highly versatile and powerful, these are the processors of choice for data centers, where scalability and flexibility are key. However, their high power consumption limits their use in mobile devices.
  • NPUs: These are optimized for AI tasks with low power and low latency, making them well-suited for mobile and edge devices. They offer a good balance of performance and efficiency, but they are less flexible than GPUs.
  • DSPs: These usually sit between a GPU and an NPU, but closer to an NPU. A DSP provides better power efficiency than a GPU for AI and other workloads with a smaller footprint. It also provides an NPU fallback and offload mechanism, acting as an AI co-processor in many cases.
  • ASICs: These deliver maximum efficiency and performance for specific inference tasks. They excel in both mobile (e.g., face unlock, voice recognition) and data centers (e.g., search, recommendation systems), but their lack of flexibility and high development cost make them best suited for large-scale, targeted deployments. This type of hardened silicon is great for targeted workloads, and they can maximize ROI. But if the workload changes or cannot execute on it, you may be stuck with a very expensive piece of silicon.

CPUs have ultimate flexibility, are extremely programmable, very adaptable, but they are not a parallel processing engine. However, a big advantage of CPUs is their ability to run any C code. “It will be very slow, but it will work,” Beets said. “This is why much of the time, the CPU is a useful fallback engine, because they can always compile the code, but GPUs are the much more effective solution there.”


Fig. 1: Comparing ability to process AI. Source: Imagination Technologies

Whichever solution is chosen, it still needs to be validated and verified. “With edge AI and mobile development, validation of the system is very important at the silicon level,” said Anup Shah, director of product management at Siemens Digital Industries Software. “Using a hardware assistant verification platform, we can model digital twins with a very high degree of fidelity and accuracy. You can validate the whole system environment, where you can test any AI accelerator along with its applications.”

ASICs vs. GPUs
ASICs are a more cost-effective and powerful alternative for specific inference tasks, and they often offer better performance and lower power consumption for those specific functions compared to GPUs. Still, due to their cost and inflexibility, the application-specific approach is more suited to the biggest phone or systems companies. Examples of ASICs include NVIDIA’s data processing units (DPUs), Google’s tensor processing units (TPUs), or AWS’ Trainium chips for AI training.

“The big system houses say, ‘I do want my own chip. I do want my own silicon,” said Marc Swinnen, director of product marketing at Ansys, now part of Synopsys. “It’s become so important, so central to everything my system does, that I can only compete if I make my own silicon to do exactly what I want, exactly how I want it, exactly the power, and optimized for the software I want to run on it.’ That has become what we now call bespoke silicon. You can say, ‘Isn’t that just ASIC by another name?’ I guess so. But in people’s minds, it is distinguished from the traditional low-cost ASIC. This is because the people doing this are high-end — NVIDIA, Microsoft, Amazon, Facebook, Google. These are all systems houses. They were software houses. They had nothing to do with making chips. But now they’ve all gotten into making chips because it becomes so important for them, and so they want their bespoke silicon. And these are all application-specific integrated circuits.”

Imagination’s Beets is seeing custom AI accelerators in two situations. “One is a truly differentiated solution. Take Google as an example. They’re making massive investments in AI, coming up with new innovative techniques, and combining that with their own hardware. If you are truly fully owning the algorithms and you can see them coming, you can design custom hardware for it. That works very well, but it’s very expensive.”

The problem with high-end custom chips, particularly in mobile, is that the OEM cannot maintain all the software ownership. “You can develop some things and some algorithms, but ultimately you need an ecosystem,” Beets said. “You’re going to need to enable that wide developer community to deliver. When they look at the wider ecosystem, they will discover, ‘How do I expose this to a developer who’s trying to write an application?’ That’s where these things fall apart. It works for their algorithm that they can anticipate and see coming.”

In addition, a high-end ASIC may not be able to adapt to new AI models and use cases as easily. “Flexibility comes in many forms, but the most important bit you get from a general-purpose GPU is that it can run things even if the architecture of, let’s say, a language model changes,” said Vitali Liouti, senior director of segment strategy, product management at Imagination.

This is a salient point given that various use cases for AI, the models, and their variants are changing faster than the hardware.

“We see in the past two years how AI has evolved, become more capable, more accessible, because the models are cheaper in terms of resources — less power, less area, said Hezi Saar, executive director of product management for mobile, automotive, and consumer IP at Synopsys, and chair of the MIPI Alliance. “But we really don’t know where it’s going to go. We think it’s linear, because that’s how human brains think, but it could jump or do something else. So the architecture chosen needs to have enough flexibility to accommodate unknowns. If AI becomes exponentially more capable, I need my chip architecture or stack of dies to be able to go there. And that puts a lot of pressure on decision makers as to how they are going to go about it.”

Imagination’s Liouti agrees. “AI is not a stable workload, and the biggest progress that we see being made is on the algorithm side. The hardware moves relatively slowly in comparison to the algorithm side, and that really pushes the need for flexibility and adaptability of the hardware, as opposed to purely fixed-function engines. Predominantly, AI continues to be a parallel compute problem. There is massive compute, but from a structural view, it’s parallel computing, sequential, or a very branchy style of compute. So it’s very well suited to parallel compute engines.”

Shifting roles for AI and DSPs
In mobile, AI also is creeping into very specific processing domain, such as camera interfaces.

“The classic camera interfaces were very rigid pipelines,” Beets said. “They implemented a fixed algorithm. What we see now is that some of those traditional processing blocks, like a de-noiser, are being replaced by an AI-guided version of a de-noiser. To some degree what they are really doing is designing a new fixed-function block which happens to be created using AI. They’ve redesigned an algorithm using AI technologies and hardened it into that pipeline. You could declare a lot of those things are domain-specific AI implementations. They’re still rigid in the end, but you could tune them between generations.”

DSPs also may increasingly be replaced by an AI-specific processor, or simply replaced by an AI algorithm, said Steve Tateosian, senior vice president of IoT, consumer and industrial MCUs at Infineon Technologies. “Some things are getting done better because AI is taking over those applications or that code base, moving away from a traditional DSP approach. This means measurements will get more accurate, and it can do more sophisticated things, such as tell a push-up from an overhead lift by detecting movement. It’s very difficult to do that with a DSP. It gets a little more capable with AI.”

One area seeing a big shift is audio, from keyword spotting and voice recognition, to beam-forming, noise suppression, and echo cancellation. “We are able to use AI to do that in a more efficient and effective way than traditional DSPs,” said Tateosian. “This is a really big deal, and the industry is at this tipping point, or maybe it already has tipped for this. It’s a big deal because doing audio on DSPs is decades old. Probably tens of thousands of PhDs have spent their career optimizing this capability, and now AI has come into that space and said, ‘That was really good, but we can do it differently, and maybe easier than the traditional way.’”

A further twist is that, in phones, the NPU may in fact be a DSP. “The most famous case of that is Qualcomm’s Hexagon, which is a DSP where they put a bunch of things together, and that now becomes a low-power AI accelerator,” said Imagination’s Liouti. “A DSP is meant for audio signals, but because of what it is doing, it is now an AI accelerator — a CPU with SIMD [single instruction, multiple data] extension that allows it to run operations in parallel and has AI acceleration features on it.”

The case for FPGAs
Essentially, CPUs are more general, ASICs are much more efficient, and GPUs lie somewhere in the middle. However, some believe the best processors here could be FPGAs.

“A lot of conversation in the last 15 years was about general-purpose GPUs, which are commercially successful, and CGRAs (course-grained reconfigurable arrays), which are less successful in the marketplace, said Ilya Ganusov, fellow and director of programmable architecture at Altera, during a recent panel discussion. “FPGAs provide an algorithmic alternative, and we argue that FPGAs are both more flexible and more manageable.”

Meanwhile, others believe embedded FPGAs (eFPGAs) blend the best of both worlds. “An ASIC is always going to get you the lowest power, fastest, smallest area, but when you want to add some compute capability to do those more difficult algorithms, that’s where you put a small amount of FPGA on the ASIC itself,” said Andy Jaros, vice president of IP sales at QuickLogic.

A good use case for this involves sparsity algorithms, which are changing all the time. “A lot of universities are studying how to do more efficient sparsity with the models,” said Jaros. “Typically, that gets done in ASIC gates or hardware, but if that algorithm keeps changing, then you want to be able to update your ASIC for the latest algorithms. You can put that particular RTL on embedded FPGA, and once the new, next-best sparsity algorithm comes into play, you just replace the embedded FPGA algorithm that’s on the ASIC.”

FPGAs also may have customized ASICs inside them, which helps to meet a lot of processing requirements, said Mayank Varshney, vice president of engineering and U.S. operations at Mirafra. “Typically, AI data centers use GPUs, CPUs, DSPs. They tend to be non-deterministic because there’s a lot of writing and reading from memories, so there are a lot of data transfers. What FPGAs traditionally are good for is deterministic results. If you’re running wide parallel processing, if you’re doing a lot of signal processing-type capabilities, they excel there.”

MCU and NPU use cases
For low-power edge devices, an MCU with a DSP, NPU, and neural network accelerator could fit the bill. “The phone has an immense amount of processing power relative to other applications in our daily lives, such as fitness watches and trackers, portable medical or home medical devices, and appliances in our home,” Infineon’s Tateosian said. “These latter examples all typically run on microcontrollers, instead of really high-performance microprocessors like you find in a phone or a laptop.”

Another difference between phones and low-power edge AI devices is that application-specific phone processors run a high-end operating system like Linux, Windows, or iOS. “The microcontroller will run a real-time operating system like FreeRTOS or Zephyr, which is a lightweight software infrastructure that developers write their application on,” Tateosian noted.

Infineon offers machine learning-enabled MCUs. Synaptics, meanwhile, has context-aware AI MCUs.


Fig. 2: Synaptics’ context-aware AI MCUs. Source: Synaptics

High-end mobile devices and data centers often feature GPUs in addition to NPUs, while low-power edge AI devices may have MCUs and NPUs only, due to their greater efficiency.

For example, Expedera recently launched its Origin Evolution NPU IP with hardware acceleration to meet the computational demands of running LLMs, convolutional neural networks, and recurrent neural networks on resource-constrained edge devices, smartphones, automotive, and data centers.


Fig. 3: Expedera’s NPU IP has a packet-based architecture to boost efficiency. Source: Expedera

Neuromorphic computing also is a candidate for AI processing in phones because it reduces power consumption, said Andy Heinig, head of department for efficient electronics in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “There is progress on the electronics side, but the ecosystem, framework, and partners are not yet available for neuromorphic accelerators.”

Conclusion
While NVIDIA GPUs continue to grab headlines in the big data sector, one size does not fit all at the edge, where applications may range from enterprise data centers to mobile devices. Power, performance and area/cost are still the primary concerns, but they can vary in importance depending on the domain, the end customer, the actual workload, and whether the power is delivered through a plug or a battery.

The edge is shaping up to be a complex mix of domain-specific, workload-specific, and generalized compute devices. But unlike in the past, when markets converged, the trend appears to be moving in a very different direction, with more customization and more granular optimization whenever the design budget allows. One size does not fit all, and it’s not clear if it ever will in this space.

Related Stories

Future-proofing AI Models
The rate of change in AI algorithms complicates the decision-making process about what to put in software, and how flexible the hardware needs to be.

AI Drives IC Design Shifts At The Edge
Rollout of artificial intelligence has created a whole new set of challenges, along with a dizzying array of innovative options and tradeoffs.



1 comments

Ricardo Shiroma says:

What about FPGAs ability to be reprogrammed post-deployment, enabling hardware-level updates as AI models evolve? That would allow for efficiency as the model evolves or the use case is better defined.

Leave a Reply


(Note: This name will be displayed publicly)