Balancing Workloads In AI Processor Designs

Optimization and architecture definition are important, but there’s a downside to too much customization.

popularity

A growing number of AI processors are being designed around specific workloads rather than standardized benchmarks, optimizing performance and power efficiency, but often with enough flexibility to adapt to future changes.

While the fundamentals of matrix multiplication and software optimization still apply, those alone are no longer sufficient. Designs need to address specific data types, when and where that data is likely to be processed, what kinds of constraints they will face, and whether workloads are likely to shift by the time a design reaches manufacturing and packaging. This is especially important with AI, where continual changes in algorithms can sharply limit the lifespan of chips that are designed too tightly around today’s workloads.

“When developing a processor, it’s important to understand the true nature of the application and the workload it will run,” said Frederic Piry, vice president of CPU Technology and fellow at Arm. “While benchmarks are valuable for demonstrating performance targets, and there’s much to be learned from them, real-world workloads introduce variables that benchmarks alone do not capture. Processor developers need to know how the application will be executed.  The frequency at which a workload executes, the presence of competing processes, and the load on shared resources like memory all affect performance. These conditions can change how memory latency is exposed, how prefetchers are tuned, and how cache topologies should be designed.”

Benchmarks typically emphasize time-to-complete, but specific workloads may have different metrics to consider.

“It’s important to think about workloads on the system level,” Piry said. “In mobile, applications running in the background could affect how processes are run, requiring designers to consider branch prediction and prefetch learning rates. In cloud environments, cores may share code and memory mapping, impacting cache replacement policies. Even the software stack has implications for structure sizing and performance consistency. Processor developers also need to think about how features are used in real workloads. Different applications may use security features differently, depending on how they interact with other applications, how secure the coding is, and the level of overall security required. All of these different choices reveal different performance tradeoffs that must be designed for and tested against.”

Consider workloads in mobile versus data centers, for example. “Mobile processors for phones run a vastly different set of workloads than market-specific and highly tuned processors for AI,” said Steve Woo, distinguished inventor and fellow at Rambus. “Architects need to understand differences in market requirements and how to prioritize different features to meet market needs. For example, mobile processors require low power consumption and fast switching between power modes, while data center AI processors used for the largest AI models need the highest performance, highest memory bandwidths, and fast connectivity to allow large problems to be split among multiple processors. Processor architects must also understand the computational characteristics of the target workload. Are there bandwidth-bound, latency-sensitive, or compute-intensive phases? And how do they interact with each other? For AI workloads, this means profiling AI models like large language models (LLMs) to understand compute, communication, and memory access patterns, available parallelism, and how data moves in the system. Algorithm analysis, code refactoring, and tools such as simulators, performance counters, and workload traces help quantify these behaviors, which are then used to inform architecture and system designs, and are especially helpful in balancing flexibility, power, and throughput in fast-evolving domains like AI, where workloads are shifting rapidly.”

Companies with a solid understanding of the workload can then optimize their own designs because they know how a device will be used. This offers significant benefits over a generic solution.

“The whole design arc is bent to service those much more narrowly understood needs, rather than having to work for any possible input, and that gives advantages right there,” said Marc Swinnen, product marketing manager at Ansys, now part of Synopsys.

This is easier said than done, however. “The cognitive load involved in neural network development is pretty high, especially when you look at optimizing these things for a device,” said Geraint North, Arm fellow for AI and developer platforms. “There are two things they need to hold in their heads. One is the architecture of their model, and they care about things like how accurate it is, how big it is, and these kinds of aspects of the model. ML developers understand that stuff pretty well these days. The challenges come when you start to move on-device, because you also have to start to worry about whether the NPU is actually capable of running the thing that I’ve optimized this model down to. That’s why we’re so focused on improving the performance of the CPU. The advantage of the CPU is that it’s general-purpose compute. You can pretty much guarantee it can chew through any operator, any data type that you throw at it. An interesting thing we saw with the work we did with Stability AI was that the parts of the network that they really optimized, SME 2, made it a bunch faster. The bits they hadn’t spent any time optimizing made those bits even faster. And so even the bits that the ML engineer hadn’t spent any time optimizing — and there are good reasons why they don’t optimize them — weren’t a huge part of the workload. SME 2 was accelerating those bits like eight to nine times faster. So the reason we focus on the CPU is that it lets the ML engineers continue to focus on the things that they worry about already today and not have to also worry about the nuances of the hardware accelerators.”

Understanding data types
AI processor developers need to know which data types are to be manipulated and the general realm of the types of algorithms to be executed.

“To develop a good audio processor, the datapath in the processor needs to natively handle the expected audio data sample bit precision,” said Steve Roddy, chief marketing officer at Quadric. “Will the audio data be 8-bit (low fidelity) or 12-bit samples? Or will the audio data be something very high precision, such as full Float32? All of the register files and hardware units (i.e., multiply/accumulates) need to comprehend the data types, especially the accumulation registers. That audio processor needs to have native instructions that efficiently run typical audio filtering and audio processing algorithms. But the developer of that audio DSP doesn’t need to work at Dolby or DTS and have access to the millions of lines of code in a 3D positional audio stack.”

The same is true for SoC developers. “Our NPU is easily integrated,” said Jason Lawley, director of product marketing for AI IP at Cadence. “We’ve got an AXI interface that sits on it, and it’s going to take care of the architectural specifications that have been optimized. Our software that goes along with it takes care of the mapping of those workloads from the framework. A lot of times, the customers are developing in PyTorch or TensorFlow. Then we map that network into the NPU, so when they’re developing their SoC they know they’ve got a general-purpose CPU portion of their design. They know they’ve got this accelerator, which is the NPU. And when they get the workload, which is typically an audio or video frame, all they have to do is compile that particular workload. When they get to the acceleration part, they call a simple API code, and it says, ‘Oh, here’s the frame. Let me send down the instructions that are built in.’ It accelerates, then they get a result. A lot of customers going down the IP path are doing that. If you don’t go down the IP path, you still have to do all of that, but maybe you’re not paying as much attention to the configurability of the engines or trying to make sure you’re fully optimized and lighting up all of the hardware in an optimal fashion. That’s okay, too, depending on whether their workload can be met.”

Similarly with AI, the key factors to consider are the data type and general use cases. “A vision-only NPU might do quite well with being primarily an INT8 machine (8 x 8 MACs),” said Quadric’s Roddy. “Want to run LLMs instead? That likely needs 4-bit weights to compress the giant models, but 16-bit activations (maybe even float16) to preserve accuracy, thus 4 x 16 MACs are the building blocks. And the breadth of how many types of networks you want to run will determine if a fixed-function NPU accelerator with only 30 or 40 graph operators supported is okay, or if you need a processor that can run any of the 2,300+ operators found in PyTorch today. All of that boils down again to data type and operator, but does not require any data science expertise in designing LLMs, or any end-market expertise to build an NPU processor that is successful in, for example, car IVI and ADAS systems. Simply put, you need to know what’s in the AI workloads, not how to create them.”

There are also differences in what the designer needs to know between different types of processors.

“Designers of battery-powered mobile processors for phones focus on power consumption and fast power state transitions, where the processor works alone in a small form-factor environment that supports small memory capacities,” Rambus’ Woo said. “In contrast, data center AI processors used for training and inference on the largest workloads focus on performance and throughput, have much larger power budgets and advanced cooling solutions, as well as access to much higher memory capacities and bandwidths. GPU designers focusing on AI training will prioritize parallelism and memory bandwidth for these data-heavy tasks. Meanwhile, processors optimized for inference often have lower resource requirements, as the models are quantized to reduce memory capacity and bandwidth requirements. Targeted processing pipelines improve efficiency on devices used outside the data center. Each processor type demands a tailored understanding of workload structure, performance bottlenecks, and deployment context, whether it’s cloud-scale, mobile, or embedded.”

Representing the workload
What is the best way to represent a workload to an NPU designer? Roddy said the easiest solution is to provide a set of representative benchmark models.

“Let’s say you wanted to build a really efficient LLM inference ‘machine’ to run Llama and DeepSeek’s QWEN model,” Roddy explained. “Start with the published, public models. The structure of the models – that is, the graph operators and the interconnection between them – are all openly published. What’s proprietary and often kept under wraps are the trained model weights — the billions of parameters.  Analyze the reference models and you can quickly see, a) the data type (precision); b) the computation required —MACs; Gaussian equations; quadratic equations; simple averaging; or control flow ops — and, c) the frequency or intensity of each of those operations, namely what gets executed billions of times versus what gets executed only twice. The only tricky part is to accurately describe what future changes you want the machine to be able to react to. Will data types change? Will the commonly used operators change? Note that PyTorch has more than 2,300 operators, most of which are seldom used today, but might suddenly become critical in future AI innovations. Known workload today, plus the envelope of future flexibility you need, will fully describe the problem to be solved.”

At the same time, the end application may provide clear restrictions, or at least boundary conditions, to be considered during processor design. “This is true, for instance, in the case of safety-critical applications, where safety mechanisms like a watchdog or lockstep need to be implemented to prevent the processor from running into any kind of deadlock,” noted Roland Jancke, head of design methodology at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “For such situations, there are respective standards and testing routines out there so that designs can be qualified according to such standards.”

Other requirements coming from the application, such as energy consumption, are not so easy to quantify. “A full-blown water-cooled AI accelerator might be desired for the tasks at hand, but will be too heavy for a lightweight drone application,” Jancke said. “In general, autonomous systems will give boundary conditions on power consumption and weight for the processor design. On the other hand, our experience in talks with industry is that often the question is asked the other way around. ‘To what degree does the developer of a specific algorithm need to know the exact structure and properties of the executing processor?’ Of course, it might increase the efficiency of an algorithm implementation to know the details about the available processing elements. Still, such architectural details might change over the course of the development cycle. With this in mind, we see an increasing trend toward an abstraction process. Setting up a well-defined interface, like an API, offers the possibility of a relatively independent development of hardware and software. The software is developed based on that API, while the hardware is focused on implementing the instructions of the API as efficiently as possible. Even if the exact application is not known and not taken into account, this may be advantageous for the development process.”

All of these technical considerations are why AI IP makes sense for some AI processor developers. “The SoC developers have to understand what that customer wants,” said Cadence’s Lawley. “But the other key piece that a lot of people overlook is the software that compiles the AI model that is going to run on their SoC. They’re going to have to develop that and maintain it through the lifetime of the solution, and with how quickly AI is changing, it’s expensive to ramp up a software team and find the right people. You have to find compiler experts who are not just AI developers. They have to take that model, compile it, and then optimize it to run on that particular hardware that they’ve created. A lot of times it’s difficult for them, unless they have very specific, bespoke workloads on which the same model is going to run all the time. For instance, in the embedded space, teams that are designing SoCs going to go into a very specific robot or something like that can say, ‘We know we’re going to have some kind of AI model.’ If you look back at the initial NPU designs, a lot of them were more like glorified matrix multiplication engines. There’s a lot of hype around what the NPU is going to do. They’re going to solve world hunger and accelerate AI. So we’ve hit a little bit of a disenfranchisement part of the NPU, but that’s temporary. We’re getting into the second and third generation NPUs. We’re on our second generation, and we’ve learned a lot about what the architecture of the NPU needs to look like to make it so that it’s much more useful to general-purpose AI acceleration.”

This interplay between processor specialization and the realities of supporting ever-evolving workloads is a big challenge for designers. Accurately anticipating activity is both necessary and elusive.

“This is easier said than done,” Ansys’ Swinnen said, “When running a program, how do you capture Apple’s Safari browser, for instance? How is this activity different than Firefox or Chrome? If Apple wants to make their chip more efficient for Safari, how do you capture that? That’s millions of lines of software code. You can’t quite capture something you use in silicon design easily.”

This is where emulation comes into play. “For verification engineers and RTL designers who have been tasked with guessing or predicting what the workload would look like, they have done a good job, but when emulation platforms were created, it was revolutionary because it took the guesswork out of the system,” observed Suhail Saif at Ansys, now part of Synopsys. “You could put your system in an FPGA format or something similar, onto the emulation box, and run a real workload before fabricating it. And when it comes to real workloads, whether it is something running on a Safari browser on an iPhone or running a YouTube video on a phone, making a call, or playing a game on an iPad, or the like, those are very real-world workload scenarios that the chip, system, or device designer would like its end user to do. There is no denying from the activity file that you get out of that emulation, because it represents a 100% real workload scenario. And if you find issues simulating your design through that workload vector, then you know for sure, without a doubt, that’s an issue you have to fix before it goes to fabrication because it is exposing a real issue from a real workload. Within power estimation, voltage drop analysis, and designing power grids for signal nets, we rely on these real-life workloads coming from emulation all the time, and they carry the maximum weight.”

Conclusion
AI processors require a balance of performance, power, and enough area to help future-proof a design. “Nobody’s going to get fired for adding a little bit of AI into their SoC,” Lawley said. “Whether you get something that’s from an IP provider, or you add some matrix multiplication capabilities into your SoC, now is the time to do it, because everybody’s going to look back and say, ‘I wish I would have had more AI capabilities in my hardware,’ because that’s the hard part. For the SoC, it’s a two- to five-year runway to get an SoC built. Where were we two years ago, and where are we going to be in two to three years from now when these devices launch? Spending area on AI capabilities and flexible AI capabilities is going to be important, even for the robot that it goes in, and even when you’re doing these very specific features. The models are changing fast. For a model that’s out there now, who knows if there’s going to be a new model that’s been fully optimized and much better. So make sure you have AI capabilities in your SoC, and then try to make sure that it’s going to be as flexible as possible, and that it’s connected into the systems that can be used in multiple ways.”

Related Reading
Reliable Training Data Paramount To AI Model Success
It’s not enough just to have a data lake. It must be trustworthy, reliable, and protected.



Leave a Reply


(Note: This name will be displayed publicly)