Partitioning Processors For AI Workloads

General-purpose processing, and lack of flexibility, are far from ideal for AI/ML workloads.


Partitioning in complex chips is beginning to resemble a high-stakes guessing game, where choices need to extrapolate from what is known today to what is expected by the time a chip finally ships.

Partitioning of workloads used to be a straightforward task, although not necessarily a simple one. It depended on how a device was expected to be used, the various compute, storage and data paths on a chip, and how different workloads were prioritized. But with the ongoing changes in AI, including a continuous stream of algorithm updates, new use cases, and entirely new applications and priorities, what is designed today often is not the optimal configuration by the time a chip ships. As a result, design teams need to consider a growing long list of options and tradeoffs, where the best choices may be a balance between flexibility and less optimal performance, power, and throughput.

“AI is a very fast-moving field, with the workloads radically changing from year to year,” said Ian Bratt, fellow and senior director of technology at Arm. “The top factor to consider when architecting a chip for AI workloads is building in the right amount of flexibility and future-proofing. With the advent of large language models (LLMs) we’ve seen new AI workloads, which are primarily bandwidth-limited — a stark contrast to convolutional neural network (CNN) workloads, which tend to be compute-limited. Bandwidth-limited neural networks like LLMs benefit from the flexibility of software-based de-quantization schemes, enabling designers to save footprint and bandwidth by inventing compression schemes that can be easily decompressed on programmable platforms like CPUs. This is just one of many workload shifts for AI, and the shifts will continue.”

Simply put, it is unwise to over-optimize and tailor the platform for the AI workload of the moment.

“When you’re thinking about something as high-performance as AI, it’s all workload-driven,” said Steven Woo, fellow and distinguished inventor at Rambus. “So you have to pick the thing you want to be good at, and that’s the thing you’re going to design for. What’s interesting about AI is if you’re talking about AI at scale, there’s a very big pipeline of data movement from storage servers all the way through to the engines and back. You go through switches and NICs and things like that. These days, AI is increasingly encompassing lots of components along the way. There are regular servers in the way that are trying to help you retrieve the data from your storage, and they’re trying to format it and transform it for the AI engines.”

Training versus inference
The first thing to think about is knowing which part a semiconductor going to be servicing. The workload there may be different than another part. What is done in a NIC is very different than what is done in an AI training engine.

“This means the memory system looks very different, the processing looks different,” Woo said. “But if you’re just thinking about the AI training engine, then increasingly you have to think about whether this is going to be used for training, for inference, or both. Training and inference have some similarities, but they have some things that are very opposite of each other. So increasingly what we’re seeing is people looking at a training-only solution, or an inference-only solution, in addition to a solution that does both pretty well. Then it comes down to the models, and the size of model you’re looking for.”

Compute intensity equals model complexity
Both inferencing and training workloads are compute-intensive.

“As the complexity of models has increased, it is less likely that leaving the work to a general-purpose processor is practical, even if the processor has been augmented with AI instructions,” said Russell Klein, program director for Siemens EDA‘s Catapult Software Division. “To be competitive, some kind of specialized acceleration hardware is required.”

The workhorses of today’s AI algorithms, convolution and matrix multiplication, are obvious candidates for acceleration.

“However, partitioning is not that simple, as moving intermediate results between general-purpose processors and an accelerator imposes a significant overhead,” Klein said. “Often it makes sense to ‘fuse’ layers of the network into a single accelerated operation. For example, a convolution may be followed by an activation function, and then a max pooling operation. Performing all three of these operations in a single, or ‘fused,’ hardware operation can be very efficient. The activation function and max pooling usually require small amounts of hardware, and can be performed in parallel with the convolution. However, an activation function can be anything. A relu (rectified linear unit) function is quite simple and easily implemented in hardware, but a hyperbolic tangent function is a different story. A good partitioning requires a clear understanding of the types of models that will be supported by the design.”

At the same time, AI algorithms almost always are embarrassingly parallel. “This means it is relatively easy to add computational units to improve performance,” he said. “But as more compute units are added, data movement becomes a challenge. As the number of computational elements increases, designers need to find a way to move data to and from them. Data movement can quickly become a bottleneck in the design. Balancing the communication and computation capabilities across a broad set of AI models is challenging.”

Architects and designers still need to worry about meeting the PPA targets, but they also need to understand the tradeoffs and options. “As an example, can performance be lowered by 5% to achieve 20% power savings?” said William Ruby, product management director for EDAG at Synopsys. “Does the die size need to be constrained? For what reason? Cost, packaging, form factor limitations?”

Gordon Cooper, product manager for AI/ML processors at Synopsys, explained that AI networks must be able to handle high computation workloads. “The ‘cost’ is not just in the compute,” he said. “It is also in the memory accesses. Designing a high-performance, low-power AI processor requires specialized hardware optimized for these huge workloads.”

For high-performance AI processing that requires some level of programmability, NPUs offer the best power efficiency.

“If you have a trained network that needs to reach a certain performance level (frames or inferences per second), you must partition the network to spread the processing across multiple processors,” Cooper noted. “There are multiple ways to partition the network — across layers of the network, channels within a layer, or spatially dividing up your image. Your design needs to be flexible, as there is no one right way to partition. You might need a mix of partitioning within each network. Also, complex software is needed to manage network partitioning.”

Co-design complexity
Hardware/software co-design can get tricky, because in many cases the workloads either are unknown in advance or they change over time.

“You’re planning a design that you will tape out about a year from now, that will be in silicon two years from now, and it has to live for five years,” said Suhas Mitra, product marketing director for Tensilica AI Products at Cadence. “How do you partition these things when you really do not know the future? You have to cut some corners. You have to bend the spoon in the direction that you want it to bend. For edge applications, the most important thing for an SoC architect is to understand how software will partition the workload. Similarly, the software people also have to understand the different capabilities in hardware that will be available for them to map these workloads. The intensity of hardware/software co-design has increased manyfold.”

A simple case of AI workload mapping would be taking the workload and understanding how to optimize that workload or the operations of that workload on certain platforms.

“If somebody says, ‘Here’s the workload to do AI processing for this device,’ you have to bring that AI workload down to a point where your software compiler tool chain can ingest the model, take the model, break it down into smaller operations, in what is called graph processing,” Mitra explained. “You then implement those operations and map them to the available hardware that you have. Not all hardware is the same, and there are ‘n’ number of ways you can map a certain thing because there is no one way of mapping. How do I make sure this is efficiently mapped? What does it mean in the context of the operation with other operations? All those issues are going to play out in how to map models very efficiently. I may also decide in my hardware to have some capability for running some functionality that is a slightly more efficient map. For example, if I have hardware that can do something like an element-wise operation better, maybe there’s a hardware ISA that can multiply or do power functions better, and I can use that. If I didn’t have that, I would go the usual way and use a log function or exponential function to do this. But if we have a kernel that can actually map this exponential on a hardware, that’s a choice I have to make. So, there’s an ‘n’ number of parameters that you’re trying to optimize on one end.”

In a slightly more complex case on a mobile platform, maybe there is a CPU, a GPU, and a vision or audio DSP. “If I take the exact same example of a vision DSP workload, I have the ability to map that workload, depending on which operation I do, on multiple endpoints,” he said. “I may decide to do operations 1, 2, 3 on a CPU, 5, 6 and 7 on a GPU, 9 to 50 on a DSP, and then go back to the GPU and back to DSP. I have different parameters, and different tradeoffs that I’m making while doing this compute. Each of these compute tradeoffs has a price to pay. It is the price of area, of data movement, of power, of bandwidth, and with that comes complexity. How do you take a model and break it up into smaller pieces? If I have 10,000 images, there’s no way I can hold those images in one place with hardware that split the thing up, because they also have to collaborate on different endpoints to bring them together. So this can be a very complex problem depending on which spectrum you’re looking at.”

Not just a neural network
Yet another consideration in partitioning processors for AI workloads is that the workload on the edge is not just a neural network. It is far more complex.

“First of all, there are multiple neural networks,” said Sharad Chole, chief scientist at Expedera. “Most often, there is also real-time sensor data that has to come in and needs to be processed so that it can be fed to the neural network. The output of the neural network needs to be processed afterwards, as well. Maybe it needs to be combined. Maybe the workload needs to do some sort of sensor fusion, and then again, execute a different neural network. It’s a pipeline of processing of neural network workloads, and neural networks. It becomes a huge pipeline, and sometimes this pipeline is known in advance, where you can actually try to schedule and optimize based on the workload. Sometimes this is not known in advance. The problem is that we can’t just say, ‘This is the neural network, get me the best NPU or the best system architecture.’ It evolves by understanding what the pre-processing is, what the post-processing is, as well as what type of precision you want to support for meeting the accuracy demand, because everything distills down to accuracy.”

What makes all of this that much more challenging is the fact that there isn’t a single approach for any of it. 3D-ICs will only make that worse.

“To really get your AI architecture to run efficiently, you need to have the silicon match your algorithm for fast, real-time delivery,” said Marc Swinnen, director of product marketing at Ansys. “Depending on the architecture, the number of layers that you use — whether using a straight neural network or a mix of neural network and some processing — if the silicon matches that, you’ll get much faster performance because you can do any of them in software. You can just write a program that simulates and that’s totally universal, but totally slow and high power. So you want silicon, and you can take generic silicon, but you’ll still get performance and power improvement by making silicon that matches exactly what you need. That’s why there are all these different silicon architectures.”

For all of these architectures, partitioning is a universal challenge. “It’s one of those very manual steps, like floor-planning, that are still done early on,” Swinnen said. “They have huge consequences downstream, and you’re not always sure what the consequences are. But what’s changed is with 3D-IC, there are literally hard partitions. You’re going to put multiple chips or one big design, or you’re going to partition pieces out onto multiple chiplets. How are you going to divide that? The consequences are pretty severe and deep, because once you partition which logic goes on which chip, then everybody goes off, designs their chips, and at the end the system is going to work — you hope.”

Partitioning also needs to include thermal considerations. “Partitioning used to be all about timing and performance, but now thermal comes into the picture because there are blooms of temperature that depend on the operating mode if it’s receiving, transmitting, running video or audio,” Swinnen noted. “Different usage modes have different spots that get hot at the center. You want to avoid having two chips overlaid or right next to each other, that have with the same usage, have the same hotspots right next to each other, because then you’re going to amplify them and that’s not good. But how do you know that upfront?”

Starting well
So given all of these challenges what’s the starting point for designing chips for AI workloads?

Siemens‘ Klein said that like any design project, engineers need to fully understand the market and the application the system will address. “Will the design need to support both training and inferencing, or just inferencing? Does it need to support every possible neural network model, or can it be focused on a specific sub-set? Are the requirements likely to change once the system is built, or are the requirements well understood and the function fixed? Will it be powered by a watch battery, or a 400 horsepower diesel generator? With an understanding of all of these issues and more, the designer can make well-informed decisions and intelligent tradeoffs. It is really a tradeoff between a design that is highly generalized and one that is specialized. Generalized systems will need to perform both training and inferencing. They will need to support any and all neural network models that anyone can dream up, or will dream up someday in the future.”

Being able to support such a breadth of algorithms requires a lot of resources, and adds both complexity and inefficiency.

“When building such a generalized system it is impractical to take advantage of operator fusion or quantization,” Klein continued. “Although, some recent accelerators do support both float32 and int8 data types, by narrowing the scope of what can be supported, designers can create a more tailored implementation which can be both faster and lower power. For example, rather than supporting all of TensorFlow, picking the most commonly used subset can simplify the design. Thus, designers could build a faster, more efficient chip, but for a smaller potential market.”

In addition, a highly specialized design, or a bespoke accelerator, could be built that only supports a very limited sub-set of possible neural network models, but can be much faster and more efficient.

“The key is to tailor the accelerator to the specific inference,” Klein said. “This is really only practical in vertically integrated markets, where the system is being designed for a single specific application. Such an accelerator can full take advantage of operator fusion and quantization. The number and geometry of the processing elements can be sized specifically to the problem at hand. And the data paths can be sized to meet the throughput and response time for the specific application. It can easily outperform the best off-the-shelf accelerators or configurable IP.”

Synopsys’ Ruby also suggests modeling the hardware architecture to analyze PPA. “Optimize software. Get to functional RTL as soon as possible, run real workloads on the emulator, and analyze power consumption,” he said.

The easy part of designing an AI processor is putting down a bunch of MAC units. “The harder part is dealing with memory, and moving data around between the memories in a way that will work for many current and future neural networks,” Synopsys’ Cooper said. “The hardest part is creating and maintaining software – a neural network compiler or SDK – to manage the hardware and memory accesses. A lot of solutions out there are actually not very efficient at maximizing utilization of all the MACs.”

Thinking outside the box
Creativity is a necessity when it comes to addressing the partitioning task for processors that are taking on AI workloads, both for a processor subsystem going into a much larger chip and a chip where the main function is AI acceleration. At the base level, virtually all SoC design starts today — for a wide array of embedded, consumer, industrial and automotive applications — must provision sufficient compute horsepower for running modern AI/ML inference workloads.

“Two or three years ago, the conventional wisdom to tackling AI inference was to build an accelerator tightly coupled to a CPU or DSP, wherein the repetitive convolution layers of ML graphs that the CPU or DSP cannot process at the required speed were offloaded to the accelerator while small portions of the ML workload such as unusual layers or SoftMax/NMS final layers run on the CPU,” said Steve Roddy, vice president of marketing at Quadric.

Inherent in those architectures was the notion that partitioning a workload between CPU, DSP and NPU was both a wise method of achieving functionality as well as an infrequent task. “But that conventional wisdom is now failing because the state-of-the-art (SOTA) models today, such as large language model transformers and vision transformers, are both radically different from the SOTA models of just two years ago,” Roddy said. “And the rate of change of models is only increasing, not decreasing. A vintage 2019 SOTA model, such as Resnet18 or Resnet50, would require only a single partition — the 50 convolution-pooling-activation chains in the backbone on the NPU with the final SoftMax on the CPU.”

A transformer structure (see figure 1 below) is far more complicated, and attempts to partition on a heterogeneous multi-core architecture will result in dozens of round trips of moving data between the separate memory subsystems of each of the multiple processing engines. “Each of those shufflings of data burns power and kills throughput while accomplishing zero real work,” he said. “Attempting to partition and map a new, complex LLM or ViT model onto a heterogeneous architecture is orders of magnitude more difficult and time consuming than the 2019 SOTA models.”

Fig. 1  Transformer structures. Source: Quadric

Fig. 1: Transformer structures. Source: Quadric

In response, Quadric created a dedicated AI/ML inference processor architecture that runs all neural network layer types, plus complex C++ control code that is increasingly embedded within new ML graphs. Every layer of a transformer – plus additional pre-processing and post-processing workloads in the signal chain – runs on the single Quadric GPNPU. “There is no partitioning required,” Roddy said. “Today’s SOTA models run. And tomorrow’s models will run also.”

Related Reading
Processor Tradeoffs For AI Workloads
Gaps are widening between technology advances and demands, and closing them is becoming more difficult.
Patterns And Issues In AI Chip Design
Devices are getting smarter, but they’re also consuming more energy and harder to architect; change is a constant challenge.
Making Tradeoffs With AI/ML/DL
Optimizing tools and chips is opening up new possibilities and adding much more complexity.

Leave a Reply

(Note: This name will be displayed publicly)