AI workloads are changing processor design in some unexpected ways.
AI is changing processor design in fundamental ways, combining customized processing elements for specific AI workloads with more traditional processors for other tasks.
But the tradeoffs are increasingly confusing, complex, and challenging to manage. For example, workloads can change faster than the time it takes to churn out customized designs. In addition, the AI-specific processes may exceed the power and thermal budgets, which may require adjustments in the workloads. And integrating all of these pieces may create issues that need to be solved at the system level, not just in the chip.
“AI workloads have turned processor architecture on its head,” said Steven Woo, fellow and distinguished inventor at Rambus. “It was clear that existing architectures didn’t work really well. Once people started realizing back in the 2014 timeframe that you could use GPUs and get tremendous gains in trading performance, it gave AI a massive boost. That’s when people started saying, ‘A GPU is kind of a specialized architecture. Can we do more?’ It was clear back then that multiply accumulates, which are very common in AI, were the bottleneck. Now you’ve got all this great hardware. We’ve got the multiply accumulate stuff nailed. So what else do we have to put in the hardware? That’s really what architecture is all about. It’s all about finding the tall peg or the long tent pole in the tent, and knocking it down.”
Others agree. “AI just lends itself to GPU architecture, and that’s why NVIDIA has a trillion dollar market cap,” said Ansys director Rich Goldman. “Interestingly, Intel has been doing GPUs for a long time, but inside of their CPUs to drive the video processors. Now they’re doing standalone GPUs. Also, AMD has a very interesting architecture, where the GPU and CPU share memory. However, the CPU is still important. NVIDIA’s Grace Hopper is the CPU-GPU combination, because not everything lends itself to a GPU architecture. Even in applications that do, there are parts that run just small CPUs. For decades, we’ve been running everything on a CPU x86 architecture, maybe RISC architecture, but it’s a CPU. Different applications run better on different architectures, and it just happened that NVIDIA focused first on video gaming, and that translated into animation and movies. That same architecture lends itself very well to artificial intelligence, which is driving everything today.”
The challenge now is how to develop more efficient platforms that can be optimized for specific use cases. “When you implement this thing in real scalable hardware, not just one-off use cases, then the challenge then becomes how do you run this thing?” said Suhas Mitra, product marketing director for Tensilica AI Products at Cadence. “Traditionally in processors, we had a CPU. And if you had a mobile platform, you had a GPU, DSP, etc. All of this got rattled because people saw these workloads are sometimes embarrassingly parallel. And with the advent of parallel computing, which is the reason GPUs became very popular — they had very good hardware engines that could do parallel processing —the suppliers easily cashed in immediately.”
This works best when workloads are understood well-defined, said Sharad Chole, chief scientist at Expedera. “In those kinds of architectures, let’s say you are trying to integrate an ISP and NPU in a tightly coupled fashion in edge architectures. The SoC leads are looking into how they can reduce the area and power for the design.”
The challenge here is to understand the latency implications of the memory portion of the architecture, Chole said. “If an NPU is slow, what would the memory look like? When the NPU is fast, what would the memory look like? Finally, the questions between balancing the MACs versus balancing the memory comes from there where we are trying to reduce as much as possible for input and output buffering.”
External memory bandwidth is a key part of this, as well, particularly for edge devices. “No one has enough bandwidth,” he added. “So how do we partition the workload or schedule the neural network so that the external memory bandwidth is sustained, and is as low as possible? That’s basically something we do by doing packetization or breaking the neural network into smaller pieces and trying to execute both pieces.”
Designing for a rapidly changing future
One of the big problems with AI is that algorithms and compute models are evolving and changing faster than they can be designed from scratch.
“If you say you’re going to build a CPU that is really great at these LSTM (long short-term memory) models, that cycle is a couple of years,” said Rambus’ Woo. “Then you realize in two years, LSTM models came and went as the dominant thing. You want to do specialized hardware, but you have to do it faster to keep up. The holy grail would be if we could create hardware as fast as we could change algorithms. That would be great, but we can’t do that even though the industry is being pressured to do that.”
This also means the architecture of the processor handling AI workloads will be different than a processor that is not focused on AI workloads. “If you look at these engines for doing training, they’re not going to run Linux or Word, because they’re not designed for general-purpose branching, a wide range of instructions, or to support a wide range of languages,” Woo said. “They are is pretty much bare-bones engines that are built to go very fast on a small number of types of operations. They’re highly tuned to the specific data movement patterns required to do the computations. In the Google TPU, for example, the systolic array architecture has been around since the 1980s. It’s very good at doing a particular type of very evenly distributed work over large arrays of data, so it’s perfect for these dense neural networks. But running general-purpose code is not what these things are designed to do. They’re more like massive co-processors that do the really big part of the computation really well, but they still need to interface to something else that can manage the rest of the computation.”
Even the benchmarking is difficult, because it’s not always an apples-to-apples comparison, and that makes it hard to develop the architecture. “This is a hard topic because different people use different tools to navigate this,” said Expedera’s Chole. “What this task looks like in the day-to-day of the design engineer is system-level benchmarking. Every part of the SoC you benchmark individually, and you’re trying to extrapolate based on those numbers what the bandwidth required is. ‘This is the performance, this is the latency I’m going to get.’ Based on that, you’re trying to estimate how the entire system would look. But as we actually make more headway during the design process, then we are looking into some sort of simulation-based approach where it’s not a full-blown simulation — like a transaction-accurate simulation within that simulation to get to the exact performance and exact bandwidth requirement for different design blocks. For example, there is a RISC-V and there is an NPU, and they have to work together and fully co-exist. Do they have to be pipelined? Can their workload be pipelined? How many exact cycles does the RISC require? For that, we have to compile the program on the RISC-V, compile our program on the NPU, then co-simulate that.”
Impact of AI workloads on processor design
All of these variables impact the power, performance and area/cost of the design.
According to Ian Bratt, fellow and senior director of technology at Arm, “PPA tradeoffs for ML workloads are similar to the tradeoffs all architects face when looking at acceleration – energy efficiency versus area. Over the last several years, CPUs have gotten significantly better at ML workloads with the addition of ML-specific acceleration instructions. Many ML workloads will run admirably on a modern CPU. However, if you are in a highly constrained energy environment then it may be worth paying the additional silicon area cost to add dedicated NPUs, which are more energy efficient than a CPU for ML inference. This efficiency comes at the cost of additional silicon area and sacrificing flexibility; NPU IP can often only run neural networks. Additionally, a dedicated unit like an NPU may also be capable of achieving a higher overall performance (lower latency) than a more flexible component like a CPU.”
Russell Klein, program director for Siemens EDA’s Catapult Software Division explained, “There are two major aspects of the design that will most significantly impact its operating characteristics, or PPA. One is the data representation used in the calculations. Floating point numbers are really quite inefficient for most machine learning calculations. Using a more appropriate representation can make the design faster, smaller, and lower power.”
The other major factor is the number of compute elements in the design. “Essentially, how many multipliers will be built into the design,” Klein said. “This brings parallelism, which is needed to deliver performance. A design can have a large number of multipliers, making it big, power hungry, and fast. Or it can have just a few, making it small and low power, but a lot slower. One additional metric, beyond power, performance, and area, that is very important is energy per inference. Anything that is battery powered, or that harvests energy, will likely be more sensitive to energy per inference than power.”
The numeric representation of features and weights can also have a significant impact of the PPA of the design.
“In the data center, everything is a 32-bit floating point number. Alternative representations can reduce the size of the operators and the amount of data that needs to be moved and stored,” he noted. “Most AI algorithms do not need the full range that floating point numbers support and work fine with fixed point numbers. Fixed point multipliers are usually about ½ the area and power of a corresponding floating point multiplier, and they run faster. Often, 32 bits of fixed point representation is not needed, either. Many algorithms can reduce the bit width of features and weights to 16 bits, or in some cases 8 bits or even smaller. The size and power of a multiplier are proportional to the square of the size of the data that it operates on. So, a 16-bit multiplier is ¼ the area and power of a 32-bit multiplier. An 8-bit fixed point multiplier consumes roughly 3% of the area and power as a 32-bit floating point multiplier. If the algorithm can use 8 bit fixed point numbers instead of 32-bit floating point, only ¼ the memory is needed to store the data and ¼ the bus bandwidth is needed to move the data. These are significant savings in area and power. By doing quantized aware training, the required bit widths can be reduced even further. Typically, networks that are trained in a quantized aware fashion need about ½ the bit width as a post training quantized network. This reduces the storage and communication costs by ½ and the multiplier area and power by ¾. Quantize aware trained networks typically require only 3-8 bits of fixed point representation. Occasionally, some layers can be just a single bit. And a 1 bit multiplier is an “and” gate.”
Also, when aggressively quantizing a network, overflow becomes a significant issue. “With 32 bit floating point numbers developers don’t need to worry about values exceeding the capacity of the representation. But with small fixed point numbers this must be addressed. It is likely that overflow will occur frequently. Using saturating operators is one way to fix this. Instead of overflowing, the operation will store the largest possible value for the representation. It turns out this works very well for machine learning algorithms, as the exact magnitude of a large intermediate sum is not significant, just the fact that it got large is sufficient. Using saturating math allows developers to shave and additional one or two bits off the size of the fixed point numbers they are using. Some neural networks do need the dynamic range offered by floating point representations. They simply lose too much accuracy when converted to fixed point, or require more than 32 bits of representation to deliver good accuracy. In this case there are several floating point representations that can be used. B-float16 (or “brain float”) developed by Google for their NPU, is a 16 bit float that is easily converted to and from traditional floating point. As with smaller fixed point numbers it results in smaller multipliers and less data storage and movement. There is also an IEEE-754 16 bit floating point number, and NVIDIA’s Tensorfloat,” Klein added.
Using any of these would result in a smaller, faster, lower power design.
Additionally, Woo said, “If you have a general-purpose core, it’s really good at doing a lot of things, but it won’t do any of them great. It’s just general. At any point in time when you’re doing your workload, there are going to be parts of that general-purpose core that are in use, and parts that are not. It takes area, it takes power to have these things. What people began to realize was Moore’s Law is still giving us more transistors, so maybe the right thing to do is build these specialized cores that are good at certain tasks along the AI pipeline. At times you will turn them off, and at times you’ll turn them on. But that’s better than having these general-purpose cores where you’re always wasting some area and power, and you’re never getting the best performance. Along with a market that’s willing to pay — a very high-margin, high-dollar market — that is a great combination.”
It’s also a relatively well understood approach in the hardware engineering world “You bring up version 1, and once you’ve installed it, you find out what works, what doesn’t, and you try and fix the issues,” said Marc Swinnen, director of product marketing at Ansys. “The applications that you run are vital to understand what these tradeoffs need to be. If you can make your hardware match the applications you want to run, you get much more efficient design than using off-the-shelf stuff. The chip you make for yourself is ideally suited to exactly what you want to do.”
This is why some generative AI developers are exploring building their own silicon, which suggests that in their eyes, even the current semiconductors are not good enough for what they want to do going forward. It’s one more of an example of how AI is changing processor design and the surrounding market dynamics.
AI also likely will play heavily into the chiplet world, where semi-custom and custom hardware blocks can be characterized and added into designs without the need to create everything from scratch. Big chipmakers such as Intel and AMD have been doing this internally for some time, but fabless companies are at a disadvantage.
“The problem is that your chiplets have to compete against existing solutions,” said Andy Heinig, department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “And if you’re not currently focusing on performance, you can’t compete. People are focused on getting this ecosystem up and running. But from our persepective, it’s a little bit of a chicken-and-egg problem. You need the performance, especially because the chips are more expensive than an SoC solution. But you can’t currently really focus on performance because you have to get this ecosystem up and running first.”
The right start
Unlike in the past, when many chips were designed for a socket, with AI it’s all about the workload.
“It’s very important when these tradeoffs are happening to have a notion of what the goal is in mind,” said Expedera’s Chole. “If you just say, ‘I want to do everything and support everything,’ then you’re not really optimizing for anything. You’re basically just putting a general-purpose solution inside there and hoping it will meet your power requirement. That, in our understanding, has rarely worked. Every neural network and every deployment case on edge devices is unique. If your chip is going into a headset and running an RNN, versus sitting in a ADAS chip and running transformers, it’s a completely different use case. The NPUs, the memory systems, the configuration, and the power consumption are totally different. So it is very important that we understand what is the important set of workloads that we want to try. These can be multiple networks. You must get to the point that the team agrees on the networks that are important, and optimizes based on those. That’s missing when engineering teams are thinking about NPUs. They’re just thinking that they want to get the best in the world, but you cannot have the best without trading off something. I can give you best, but in what area do you want the best?”
Cadence’s Mitra noted that everybody thinks about PPA in a similar way, but then people emphasize which part of the power, performance, area/cost (PPAC) they care about. “If you’re a data center guy, you may be okay with maybe sacrificing a little bit of area, because what you’re gunning for is very high-throughput machines because you need to do billions of AI inferences or AI things, which at one shot are trading market shares while running humongous models that lead to humongous amounts of data. Long gone are the days when you can think about a desktop running things for AI model development work for inferencing but even the inferencing for some of these large language models thing is getting pretty tricky. It means you need a massive data cluster and you need massive data compute on the data center scale at the hyperscalers.”
There are other considerations as well. “Hardware architectural decisions drive this, but the role of software is also critical,” said William Ruby, product management director for Synopsys’ EDA Group, noting that performance versus energy efficiency is key. “How much memory is needed? How would the memory subsystem be partitioned? Can the software code be optimized for energy efficiency? (Yes, it can.) Choice of process technology is also important – for all the PPAC reasons.”
Further, if power efficiency is not a priority, an embedded GPU can be used, according to Gordon Cooper, product manager for AI/ML processors at Synopsys. “It will give you the best flexibility in coding, but will never be as power- and area-efficient as specialized processors. If you are designing with an NPU, then there are still tradeoffs to make in terms of balancing area versus power. Minimizing on-chip memory should significantly decrease your total area budget, but will increase data transfers from external memory, which significantly increases power. Increasing on-chip memory will decrease power from external memory reads and writes.”
Conclusion
All of these issues increasingly are becoming systems problems, not just chip problems.
“People look at the AI training part as, ‘Oh wow, that’s really computationally heavy. It’s a lot of data movement,'” said Woo. “And once you want to throw all this acceleration hardware at it, then the rest of the system starts to get in the way. For this reason, increasingly we’re seeing these platforms from companies like NVIDIA and others, which have elaborate AI training engines, but they also may have Intel Xeon chips in them. That’s because there is this other part of the computation that the AI engines just are not well suited to do. They’re not designed to run general-purpose code, so more and more this is a heterogeneous system issue. You’ve got to get everything to work together.”
The other piece of the puzzle is on the software side, which can be made more efficient through a variety of methods such as reduction. “This is the realization that within AI, there’s a specific part of the algorithm and a specific computation called a reduction, which is a fancy way of taking a lot of numbers and reducing it down to one number or small set of numbers,” Woo explained. “It could be adding them all together or something like that. The conventional way to do this is if you’ve got all this data coming from all these other processors, send it through the interconnection network to one processor, and have that one processor just add everything. All these numbers are going through this network through switches to get to this processor. So why don’t we just add them in the switch, because they’re all going by the switch? The advantage is it’s similar to in-line processing. What’s fascinating is that once you’re done adding everything in the switch, you only need to deliver one number, which means the amount of network traffic goes down.”
Architecture considerations like this are worth considering because they tackle several issues at once, said Woo. First, the movement of data across networks is incredibly slow, and that tells you to move the least amount of data possible. Second, it gets rid of the redundant work of delivering the data to a processor only to have the processor do all the math then deliver the result back. It all gets done in the network And third, it’s very parallel so you can have each switch doing part of that computation.
Likewise, Expedera’s Chole said AI workloads now can be defined by a single graph. “Having that graph is not for a small set of instructions. We are not doing one addition. We are doing millions of additions at once, or we are doing 10 million matrix multiplication operations at once. That changes the paradigm of what you are thinking about execution, how you’re thinking about instructions, how you can compress the instructions, how you can predict and schedule the instructions. Doing this in general-purpose CPU is not practical. There is too much cost to be able to do this. However, as a neural network, where the number of MACs that are active simultaneously is huge, the way you can generate the instructions, create the instructions, compress the instructions, schedule the instructions, changes a lot in terms of the utilization and bandwidth. That has been the big impact of AI on the processor architecture side.”
Related Reading
Partitioning Processors For AI Workloads
General-purpose processing, and lack of flexibility, are far from ideal for AI/ML workloads.
Processor Tradeoffs For AI Workloads
Gaps are widening between technology advances and demands, and closing them is becoming more difficult.
Specialization Vs. Generalization In Processors
What will it take to achieve mass customization at the edge, with high performance and low power.
Maybe the problem lies in our attempt at solving the problem beginning on the wrong foot. It “worked for a little while” but did not “scale well” as the problem solution space enlarged into the “N Dimension” solution space. We are, to a point, stuck with the base algorithm arithmetic sizing not being aligned with the problem space (i.e. N Dimensions being large enough). Going backwards with a new arithmetic architecture requires more “grit” than most would like to contend with in their lives – expect it to take a long time before we’re able to see clearly through the present “Bottleneck Wars” and finally find a way into the long expected Aritificial General Intelligence….,
Interesting points, Gil!