Mass Customization For AI Inference

The number of approaches to process AI inferencing is widening to deal with unique applications and larger and more complex models.

popularity

Rising complexity in AI models and an explosion in the number and variety of networks is leaving chipmakers torn between fixed-function acceleration and more programmable accelerators, and creating some novel approaches that include some of both.

By all accounts, a general-purpose approach to AI processing is not meeting the grade. General-purpose processors are exactly that. They’re not designed or optimized for any specific workloads. And because AI consumes a significant portion of system power, focusing on specific use cases or workloads can provide big power savings and better performance in a smaller footprint.

“Over the last decade, AI has had a far-reaching impact to both computing and the semiconductor industry — so much so that now specialized processor architectures have been adopted, and specialized components that serve just the AI market have also been developed and adopted,” said Steven Woo, fellow and distinguished inventor at Rambus.

But that specialization comes at a cost. “With ML and AI, the compute demand is insatiable,” said Ian Bratt, Arm fellow and vice president of machine learning technology. “If you could do 10X more compute, people would use it because when you run a 10X bigger model, you can do a better job. Because that demand is insatiable, that pushes you toward optimizing for that workload, and there are different types of NPUs that have been built that can achieve very good energy efficiency on specific classes of neural network models, and you get fantastic ops per watt and performance in those spaces. However, that comes at the cost of flexibility, because nobody knows where the models are going. So it trades off the future-proofing aspect.”

As a result, some engineering teams are looking at different approaches to optimization. “General-purpose compute platforms like CPUs and GPUs have been adding more internal acceleration for neural networks, without sacrificing the general programmability of those platforms, like CPUs,” Bratt said. Arm has a roadmap of CPU instructions, and has been adding to the architecture and the CPUs for several years in order to improve ML performance. “While this is still on the general platform, you can get a lot of the way there. It’s not as good as that dedicated NPU, but it’s a more flexible and future proof platform,” he said.

Improving efficiency is essential, and it affects everything from the amount of energy required to train an AI model in a hyperscale data center to the battery life of an edge device doing inferencing.

“If you take a classic neural network, where you have layers of nodes and the information goes from node to node, the essential difference in the training and execution is that during training, you have back-propagation,” said Marc Swinnen, director of product marketing at Ansys. “You take your data set and you run it through the nodes. Then you calculate your error function, meaning how wrong the answer is compared to the tagged results you know you need to achieve. Then you take that error and back-propagate and adjust all the weights on the nodes and on the connections between them to reduce the error. Then you sweep through again with some more data, and then back-propagate the error again. You go back and forth, and back and forth, and that is in training. With each sweep you improve the weights, and eventually you hope to converge to a set of trillions of weights and values on the nodes, biases, and weights and values that will give reliable outputs. Once you have those weights and all those parameters for every node, and you execute the actual AI algorithm, then you don’t do back-propagation. You don’t correct it anymore. You just put in the data and feed it through. It’s a much simpler, uni-directional processing of data.”

That back-propagation requires a lot of energy for all the calculations. “You have to average over all the nodes and all the data to form the error function, and then it has to be weighted and divided, and so on,” Swinnen explained. “There’s all the math of the back propagation — that doesn’t happen in actual, real execution [during inference]. That’s one of the big differences. There’s a lot less math that needs to be done in inference.”

That’s still leaves a lot of processing, however, and the trend line only points up and to the right as AI algorithms become more complex and the number of floating point operations increases.

“The number of floating-point operations performed by the winning ImageNet ‘Top1’ algorithm over the past five years has increased by a factor of 100,” said Russ Klein, program director in the High-Level Synthesis Division at Siemens Digital Industries Software. “Of course, LLMs are setting new records for model parameters. As the compute load goes up, it becomes less practical to run these models on a general-purpose CPU. AI algorithms are generally highly data-parallel, meaning it is possible for operations to be spread across multiple CPUs. That means performance can met by just applying more CPUs to the problem. But the energy needed to perform these computations on a CPU can be prohibitive. GPUs and TPUs, which generally have higher power consumption, do the computations faster, resulting in lower energy consumption for the same operations.”

Nevertheless, demand for more processing continues to rise. Gordon Cooper, product manager in the Solutions Group at Synopsys, pointed to a sharp uptick in the number of benchmark requests for generative AI inference, indicating increasing interest. “More than 50% of our recent benchmark requests had at least one generative AI model on the list,” he said. “What’s harder to assess is whether there is a specific use case they have in mind, or whether they are hedging their bets and saying, ‘This is the trend. I must be able to tell people I have this.’  I see the need to claim the capability is still ahead of the use cases.”

At the same time, the pace of change in these models continues to increase. “We’re still a long way from hard-wired AI, meaning an ASIC, such that, ‘Here it is. The standard is set. These are the benchmarks, and this will be the most efficient,’” Cooper said. “Therefore, programmability remains critical, because you must be able to have some level of programmability for the next thing that comes along to make sure that you have some flexibility. However, if you’re too programmable, then you’re just a general-purpose CPU or even a GPU, and then you’ve not taken advantage of the power and area efficiency efficiency of an edge device. The challenge is how to be as optimized as possible, and yet still be programmable for the future. That’s where we, and some of our competitors, are trying to hover in that space of being flexible enough. An example would be activation functions, such as ReLU (rectified linear unit). We used to hard-wire them in, and now we see that’s ridiculous because we can’t guess what they’re going to need next time. So now we have a programmable lookup table to support anyone in the future. It took us a few generations to get to the point where we could see that we’ve got to start making it more flexible.”

AI processing evolves
AI’s rapid evolution is enabled by big advances in compute performance and capacity. “We’re now at AI 2.0,” said Rambus’ Woo. “AI 1.0 was really characterized by the first forays into using AI throughout computing. Things like voice assistance and recommendation engines started to gain traction because they were able to deliver higher quality results using AI. But as we look back, in some ways, they were limited. There were certain types of inputs and outputs that the systems could use, and they weren’t really generating the kind of quality of information that they’re able to generate today. Building on AI 1.0 is where we are today. AI 2.0 is characterized by systems that now create something new from the data that they’ve learned on, and from the inputs they’re getting.”

Chief among these techniques are large language models and generative AI, with co-pilots and digital assistants that help humans to be more productive. “These systems are characterized by multi-modal types of inputs and outputs,” Woo explained. “They can take many things as input, text, video, speech, even code, and they can produce something new from it. In fact, multiple types of media can be produced from it, as well. All of this is another step toward the greater goal of artificial general intelligence (AGI), where we’re looking as an industry to try and provide more human-like behavior that builds upon the foundations that AI 1.0 and AI 2.0 have been able to set for us. The idea here is to be able to really adapt to our environment and to customize the results for specific users and specific use cases. There’ll be improvements in the way content is generated, particularly in the case of things like video, and even in the future, using AGI as a way to guide autonomous agents like robot assistants that can both learn and adapt.”

In this journey, the sizes of AI models have been growing dramatically — about 10X or more per year. “Today, the largest models that are available in 2024 already have crossed the trillion-parameter mark,” he said. “This is happening because larger models provide more accuracy, and we’re still in the early days of getting models to a point where they can be very productive. And of course, it’s still a stepping stone on the way to AGI.”

Three or four years ago, before vision transformers and LLMs, it was common to see SoC requirements specifications for new NPU functionality limited to a small set of well-known, well-optimized detectors and image classifiers such as Resnet50, ImageNet v2, and a legacy VGG16. “More often than not, semiconductor companies would evaluate third-party IP for those networks, yet ultimately decide to build their own accelerator for the common building block graph operators found in those baseline benchmark networks,” said Steve Roddy, chief marketing officer at Quadric. “In truth, the vast majority of AI acceleration in volume SoCs are home-grown accelerators. Tear-downs of all the leading mobile phone SoCs of 2024 will prove the point that all of the top-six volume mobile SoCs utilize in-house NPUs.”

Many of those will likely be replaced by, or supplemented with, commercial NPU designs with more flexibility. “Requests for proposals for new NPU IP routinely include 20, 30, or more networks spanning a range of classic CNNs, such as Resnet, ResNext, and others, new complex CNNs (i.e., ConvNext), vision transformers like the SWIN transformer and Deformable Transformers, along with GenAI LLMs/SLM, of which there are too many model variants to count,” Roddy said. “It is simply infeasible to build hard-wired logic to accelerate such a wide variety of networks comprised of hundreds of different variants of AI graph operators. SoC architects as a result are searching for more fully programmable solutions, and most internal teams are looking to outside third-party IP vendors that can provide the more robust compiler toolsets needed to rapidly compile new networks, rather than the previous labor-intensive method of hand-porting ML graphs.”

History repeats
This evolution in AI is akin to what happened in compute over time.  “First, computers were in the data centers, then that compute started to proliferate out,” said Jason Lawley, director of product marketing for the Neo NPU at Cadence. “We moved to desktops that went to people’s homes and expanded out. Then we got laptops, followed by mobile phones. AI is the same way. We can look at the computational intensity that’s required to do AI starting in the data centers. We see that now with NVIDIA. Having said that, there’s always going to be a place for mainframes and data centers. What we’re going to see is a proliferation of AI as it moves out from the data center, and we see this proliferation of AI moving from the data center out to the edge. As you move to the edge, you get a huge variety of different types of applications that are required. Cadence focuses on video, audio and radar with other compute classes that sit around those, and each of those pillars is an accelerator to the application processor. Within each of those pillars, they might need to do more AI, so the AI NPU becomes an accelerator to the accelerator.”

Customer behavior is evolving, as well. “Increasingly, systems companies and end users have their own proprietary models, or models retrained with proprietary data sets,” Roddy said. “Those OEMs and downstream users cannot or will not release proprietary models to a silicon vendor to have the silicon vendor’s porting team get a new model working. And even if you can work out the NDA protections up and down the supply chain, a working model dependent upon manual labor to tune and port ML models cannot scale widely enough to support the entire consumer electronics and industrial electronics ecosystem. The new working model is a fully programmable, compiler-based toolchain that can reside in the hands of the data scientist or software developer creating the end application, which is exactly how toolchains for leading CPUs, DSPs and GPUs have been deployed for decades.”

Increasing complexity of algorithms puts additional pressure on engineering teams
As algorithms grow in complexity, designers are being pushed toward higher levels of acceleration. “The more highly tailored to a specific model the accelerator is, the faster and more efficient it can be, but the less general it will be,” said Siemens’ Klein. “And it will be less resilient to application and requirement changes.”

Fig. 1: The relationship between power and performance for different execution platforms for running AI models, CPUs, GPUs, TPU, and custom accelerators. Source: Siemens Digital Industries Software 

Fig. 2: Inferencing is increasing in complexity. Source: Siemens Digital Industries Software

Rambus’ Woo also sees a trend move toward larger AI models, because they provide higher quality, more capable, and more accurate results. “It’s showing no signs of slowing down, and we expect into the future that we’re going to continue to see a much higher demand for more DRAM capacity and more DRAM bandwidth. We expect this to continue. We’re all familiar with the AI training engines as being the showcase part of AI, at least from the hardware side. Compute engines from companies like NVIDIA and AMD, and specialized engines that are produced by companies like Google with its TPU, are great advances in terms of the ability of the industry to compute on and provide much better AI. But those engines must be fed with a lot of data, and data movement is one of the key limiters these days for how quickly we can train our models. If those high-performance engines are waiting for data, then they’re not doing their job. We must make sure that the whole pipeline is designed to provide data in a fashion that we can keep those engines running. If we look left to right, what is often the case is large amounts of data are stored, sometimes in a very unstructured way, and so they’ll be on things like SSDs or hard disk drives, and those systems are tasked with pulling the most relevant and important data out to train whatever model it is we’re training and get it into a form that the engines can use. Those storage systems have quite a lot of regular memory in them, as well, for things like buffers and all that. So just as an example, some of those storage systems can have up to a terabyte of memory capacity. Once the data is pulled out of the storage, it’s sent to a set of servers to do data preparation. Some people call this the reader tier. And the idea here is to take that unstructured data and to then prepare it so that it can be used in a way that the AI engines can best train on.”

At the same time, alternative numeric representations can further improve PPA. “Floating point numbers, which typically are used for AI training and inferencing in Python ML frameworks, are not an ideal format for these calculations” Klein explained. “The numbers in AI calculations are predominantly between -1.0 and 1.0. Data is often normalized to this range. While 32-bit floating-point numbers can span from -1038 to 1038, this leaves a lot of unused space in both the numbers and the operators performing computations on those numbers. The hardware for the operators and the memory storing the values take up silicon area and consumes power.”

Google created a 16-bit floating point number format called brain float (bfloat), which targets AI calculations. With half the storage area for model parameters and intermediate results, there is a big improvement in PPA. Vectorized (SIMD) bfloat instructions are now an optional instruction set extension for RISC-V processors. Some algorithms are deployed using integer or fixed-point representation.  Moving from a 32-bit float to an 8-bit integer requires one-quarter of the memory area. Data is moved around the design four times faster, and the multipliers are 97% smaller. The smaller multipliers allow for more operators in the same silicon area and power budget, enabling greater parallelism. “Posits” are another exotic representation that works well on AI algorithms.

“A general-purpose AI accelerator, like those produced by NVIDIA and Google, must support 32-bit floating-point numbers, as some AI algorithms will require them,” Klein said. “In addition, they can add support for various sizes of integers, and possibly brain-floats or posits. But supporting each new numeric representation requires operators for that representation, which means more silicon area and power, hurting PPA. Some Google TPUs support 8- and 16-bit integer formats, in addition to 32-bit floating-point numbers. But if an application has an optimal sizing of 11-bit features and 7-bit weights, it doesn’t quite fit. The 16-bit integer operators would need to be used. But a bespoke accelerator with an 11 x 7 integer multiplier would use about 3.5 times less area and energy. For some applications that would be a compelling reason to consider a bespoke accelerator.”

With all roads leading to customization, there are a number of considerations that chip designers need to know about customized AI engines.

“When you license something that is highly customized, or customized to any degree, what you’re going to get is something that’s going to be different,” said Paul Karazuba, vice president of marketing at Expedera. “It’s going to be something that’s not a standard. So there’s going to be a bit of a learning curve on that. You’re going to get something that is, let’s say, boutique, and with that, there are going to be hooks inside that are going to be unique for you as a chip designer. That means there’s a learning curve for you as a chip designer, as an architect, for understanding exactly how that’s going to play in your system. Now, there are advantages for that. If standard IP, like a PCIe or a USB, has stuff inside of it that you don’t want or you don’t need, maybe there are hooks inside of it that don’t play well with the architecture that you as a chip designer have chosen.”

This is essentially margin in the design, and it impacts performance and power. “When you get a customized AI engine, you can make sure those hooks that you don’t like don’t exist,” Karazuba said. “You can make sure that IP plays well inside of your system. So there are certainly advantages to coming to something like that. There are disadvantages, as well. You don’t get the scale that you get from a standard IP. But with something that’s highly customized, you are going to have exactly that. You’re going to have something that’s customized, that’s going to have some advantages for your system, but you’re going to deal with longer lead times. You’re probably going to deal with something that is unique to use. There are going to be some intricacies.”

However, the benefits can outweigh the learning curve. In an early customer example, Karazuba recalled, “They had developed their own internal AI network that was designed to reduce noise in a 4k video stream. They wanted to do 4k video rate. This is a network they developed internally. They had spent millions of dollars to build this out. They intended initially on using the NPU that existed on their application processor, and that is, as you would suspect, a general-purpose NPU. They put their algorithm on that NPU, and they got two frames per second, which is obviously not video rate. They came to us, and we licensed them a targeted, customized version of our IP. They had a chip built for them containing our IP, running the exact same network, and it got them 40 frames per second, so a 20X increase in performance by building a focused engine. The other benefit is they were able to run it at half the power of what the NPU on their application processor was consuming because it was focused. So 20X the throughput at less than half the power. And to be fair, it’s the same process node as the application processor, so it really was an apples-to-apples comparison. Those are the type of benefits that you see from doing something like this. Now, obviously there’s the cost aspect of what you put on there. It’s going to be more expensive to build your own chip than it is going to be to use something that’s present on a chip that you’re already buying. But if you can differentiate your product with this AI, and you can get this level of performance, that additional cost may not be a barrier.”

Conclusion
In terms of the direction for the future, Arm’s Bratt says there’s more than enough AI/ML to go around. “What we will see is that for cases where people really care about the energy efficiency and the workloads are slower-moving, like deeply embedded environments, you’ll see these dedicated NPUs with highly optimized models that are targeted towards those NPUs, and you’ll get great performance. But in general, the programmable platforms like the CPUs will keep moving forward. They’ll keep getting better at ML, and they’ll be running those new workloads that come fresh out of the box. Maybe you can’t map them to the existing NPUs, because they’ve got new operators or new data types. But as things stabilize, and for certain verticals, you’ll take those models that are running on programmable platforms, and you’ll optimize them for NPUs, and you’ll get the best performance in that embedded vertical, like a surveillance camera or other application. These two modes will co-exist going forward for quite a while.”

What chip architects and design engineers need to know about the changes that are coming from AI processing boils down to three things — storing, moving, and computing data, Cadence’s Lawley said. “Fundamentally, those three things haven’t changed since the beginning of Moore’s Law, but the biggest thing that they must be aware of is this trend toward low power and optimal data use, as well as advances around quantization — the ability to pin memory into a system and efficiently re-use it. So in the movement of data, the storage of data, and then the computation of data, what kind of layer fusion should be used? Software plays just as an important role in this as hardware does a lot of times, so the ability for the algorithms to not incorrectly compute things that don’t need to be computed, and move things that don’t need to be moved — that’s where a lot of our focus is. How do we get this maximum performance with the minimum energy? It’s a hard problem to solve.”

Related Reading
New AI Processors Architectures Balance Speed With Efficiency
Hot Chips 24: Large language models ratchet up pressure for sustainable computing and heterogeneous integration; data management becomes key differentiator.
Higher Density, More Data Create New Bottlenecks In AI Chips
More options are available, but each comes with tradeoffs and adds to complexity.



Leave a Reply


(Note: This name will be displayed publicly)