AI Accelerator Architectures Poised For Big Changes

Design teams are racing to boost speed and energy efficiency of AI as it begins shifting toward the edge.

popularity

AI is driving a frenzy of activity in the chip world as companies across the semiconductor ecosystem race to include AI in their product lineup. The challenge now is how to make AI run faster, use less energy, and to be able to leverage it from the edge to the data center — particularly with the rollout of large language models.

On the hardware side, there are two main approaches for accelerating AI. One involves discrete co-processors, which can be added into some sort of advanced package as separate chips or chiplets. The other involves customized accelerators that are embedded in an SoC as soft IP. Both approaches work for improving performance by targeting specific types of data, particularly when combined with shorter and fatter signal paths. But there are significant tradeoffs for each.

Cramming everything onto a single chip is generally simpler and cheaper with homogeneous compute elements, such as redundant arrays of GPUs. But the dynamic power density and thermal concentration are higher, and the energy efficiency of these general-purpose devices is lower because they are not optimized for different data types. Adding customized accelerators into those architectures removes some of the cost benefits, but it creates new challenges that must be addressed, particularly those involving parasitics that are complex and often unique to each design.

These issues are generally easier to manage when different processing elements and memories are integrated and assembled inside an advanced package. The downside is the distances are typically longer than if everything is packed onto a single chip, and the cost is higher — at least for now. Simulation, inspection, metrology, and test are more complicated and time-consuming, and much of this is highly customized today. While some of these differences are expected to be ironed out over time, particularly with the advent of 3D-ICs and commercially available chiplets, it still may be years before there is parity between these approaches.

That also will help pull out of the data center and into edge devices, where it can be used for everything from data pre-processing to inferencing closer to the source of data.

“With all the activity in AI in the past years, a lot of the focus has been on training because of the excitement about large language models,” said Tony Chan Carusone, CTO at Alphawave Semi. “We see a lot of news about how many millions of dollars it costs to train one of these things, and the weeks and months to do so. These are massive exercises, and training is done on a battery of GPUs, or TPUs, or whatever they’ve got that are dedicated to doing those ML computations as fast as possible. It’s not so much of an architecture where you’ve got a CPU and a GPU paired up everywhere. It’s more like a battery of hundreds or thousands of GPUs networked together to each other, and to memory, with as much bandwidth as possible.”

Today, the same AI accelerator typically is used for both training and inference, but this is likely to change going forward, particularly with the increased focus on large language models.

“If you’re trying to train something, like AI on video or self-driving cars, these are a massive training exercise,” said Chan Carusone. “You’ve got who knows how many exabytes of video to do the training, and all you see about custom silicon designs now being tailored to this kind of workload is focusing mostly on the training at first. But over time, the inference is going to be the bigger workload. This is because every time you train a model that big, you’re going to want to use it for lots of inference to get the ROI. We’ll see that sort of segue way to inference where there’s a second wave of development on hardware for inference. There will be bigger volume there in silicon.”

But optimizing AI accelerators can be very different than other processors — particularly if they are not customized for a specific use case or data type — often requiring extensive simulation and data analytics. “The variety of workloads makes the optimization task that much larger,” said Larry Lapides, vice president at Imperas Software. “And while the performance of an individual compute element in an AI accelerator is important, even more important is how the many compute elements (often heterogeneous) in an AI accelerator work together.”

Lapides noted that some RISC-V developers currently expect AI accelerators to account for 30% to 40% of their revenue over the next three years.

Where are AI accelerators used?
How and where AI accelerators are used must be considered up front. “In consumer devices and smartphones, it’s generally as a co-processor,” said Paul Karazuba, vice president of marketing at Expedera. “In automotive applications, it’s generally part of the SoC.”

That has a big impact on the accelerator design. Karazuba noted that with a co-processor, you’re generally focusing on some target applications. “You are more specific in your implementation. And when it comes to a co-processor, you have the ability to optimize the IP or the processor specifically for the use case(s) that your customers have. If I’m a smartphone maker, or I’m a consumer device maker, or even a consumer device chip maker, I have some specific target applications in mind, and that’s what’s causing me to build a co-processor. That means from an IP point of view, I can focus the way I do my processing optimally toward the type of networks. Are they doing transformers? Are they doing floating point inference or fixed point? What type of networks are they running? There are lots of tricks I can do in my IP to optimize processing for those, and that can get you significant improvements in performance, power, and area. A lot of that comes from utilization. You can drive your utilization way up by focusing on the models.”

On the flip side, chips used in data centers or automotive applications may need to last for 10 years or more. “With that in mind, what are you going to deploy? You have to build a general-purpose engine,” Karazuba said. “You have to build an engine that is capable of doing everything you need to do today, as well as the unknown things that are coming down the pipe. What that means is you’re going to build a big engine with an incredible amount of computational power, with an incredibly broad computational arsenal. You’re going to do a lot of things moderately well. Then, in a co-processor, you’re going to do fewer things a whole lot better.”

Implementation issues
While the bank of knowledge for designing AI accelerators is expanding, there are still troublesome pockets, particularly around the data.

Neil Hand, director of marketing, IC segment at Siemens EDA, recalled a research paper that described a scenario in which an AI was tricked by changing a few pixels in an image, thereby disrupting the AI inference engine. “The researchers added noise to the image so the AI would now say, ‘This is not a horse, it is a duck.’ What’s interesting is that they were basically reverse-engineering the AI networks to understand how to insert noise to disrupt the networks.”

And this is where things get interesting. AI accelerators need to be co-developed with algorithms, which may be biased at the point of accelerator development, or they may become biased over time.

“As we go forward, there are easy things to judge, such as whether we are addressing bias, because that’s just going to determine the dataset. Richer data determines the data set,” Hand said. “We’ve got to deal with the fact that people expect anti-bias to be perfectly balanced, but there are biases in nature. Are we trying to eliminate bias or are we trying to mirror the biases that exist in the data set?”

To address this issue, SRC is actively working on debuggable AI. “There are a number of projects funded by SRC on the AI hardware side that are looking at debuggability, because there are two key challenges at the deployment level with AI systems at scale,” Hand said. “First, they’re non-deterministic. You can give them a different data set and they’ll still give you the correct answer. But their weights will all be different. They’ll be all over the place. Or you could give them the same data set but change the order that data is presented, and the results will change. So there’s non-determinism in the system itself. That’s problematic. How do we address that? Richer data, bigger data sets, etc. Then there’s the debuggability of it. ‘It didn’t work. How do I repeat that problem?’ In traditional things like random simulation, we have a seed. We put the same seed in, we get the same result. But what do we do for hardware? There has to be the ability within the AI to understand how I can both understand the network, because a lot of the networks people don’t understand in detail what the different layers are doing with each layer. There’s some knowledge of, ‘This is the refinement layer,’ and now it’s, ‘When you’re looking at a face, this is the node that is determining is it this.’ First, understand the separations within the AI system, then address the repeatability of it. Something went wrong. It detected the cat as an intruder. Why did it detect the cat in this particular instance? Why was the cat detected as an intruder? And how do I replay that back so I understand it as the designer, then fix my system?”

On-chip network considerations
Another key consideration is how data moves within a system. “Arguably, the data transport — and with that, the related network-on-chip (NoC) architecture aspects — are the limiting factor in today’s AI accelerator development,” said Frank Schirrmeister, vice president for solutions and business development at Arteris. explained the critical components in AI acceleration are computation, data transport, and data storage.

The critical components in AI acceleration are computation, data transport, and data storage. The following graph shows that while the transistor count progresses, single-threaded CPU performance has flattened for over a decade. GPU-computing performance, meanwhile, has doubled yearly, creating a 1,000X improvement in a decade compared with single-threaded CPUs.

Fig. 1: Single-threaded CPU vs. GPU performance, plotted on https://bit.ly/3P5y9OK. Source: Arteris/NVIDIA and “Fast validation of DRAM protocols with Timed Petri Nets,” M. Jung et. al, MEMSYS 2019

While the processing speed has increased, memory bandwidth has not progressed at the same pace, leaving AI chip architects are fighting the so-called memory wall.

“Architecturally, external DRAM accesses limit performance and power consumption,” Schirrmeister said. “As a result, architects must balance local data reuse vs. sharing within the chip and into the system, and use closely coupled memories and internal SRAMs as well as standard buffer RAMs managed at the software level. Some aspects of the designs need to consider hardware cache coherence. For AI accelerators, as for GPUs, the paths between custom hardware elements must optimize bandwidth using wide data paths when needed, narrower when not, and optimize for latency by focusing on data paths highest performance. As a result, the data-transport architectures — and the NoCs — can make or break AI acceleration.”

Accelerator behaviors and security
While much attention is paid to the volume and quality of the data, there is a growing focus on the robustness of the AI accelerator, as well. This is an area that could use some help, particularly as commercial accelerators are rolled out that may be used for multiple data types.

“People who build AI applications need to be concerned about making their AI robust, making it work well across a broad range of data, and making it deal properly with data that’s outside of the training data, outside of the range of expected data,” said Mike Borza, Synopsys scientist. “You get aberrant behaviors because somebody has been able to provoke the system with data that was outside of the training set. To address this, adversarial training has become a big part of the training process. This means taking data, corrupting it a little bit, and trying to break into the system — or at least train the system not to react to data that has been corrupted. So train it wrong, then train it to do something predictable and safe when it’s wrong.”

Training models usually involves a massive amount of, which needs to be well protected as it’s being collected. “The integrity of that data is paramount because it’s expensive to get it,” Borza said. “People generally want to keep it confidential — at least they should — unless they don’t mind driving around for a million kilometers, then giving all that data away to everybody so that everybody’s self-driving car can get better at the same rate. That’s okay, but it’s not necessarily what you want to do if you’re a commercial company looking for a competitive advantage. Data confidentiality approaches answer that, but data integrity is paramount because if that data doesn’t have integrity, then you may be training the AI to behave in a way that is not acceptable. Once you’ve stored that data, it needs to be protected with its own integrity because data can decay over time. It’s well known that there’s corruption in databases. The larger the database, the higher the probability that there is corrupt data in there that needs to be detected, found and either weeded out or corrected. All of this is in the backend, and protecting the communications is part of that because that data needs to move around a lot in order to get it to the place where it’s going to be used to train models. Subsequently, the models need to have their integrity protected as you distribute them from where they were trained to where they’re being used.”

Further, the integrity of the model is paramount to the correct behavior of the AI, and the inference engines interpret the real world data to produce whatever response to that stimulus they’re going to produce. “If the model doesn’t have integrity, then it’s going to produce a different response than what was intended for that model. And that response can be drastically different because of the propensity of AI systems to go wildly off the rails if they’re given corrupt data that doesn’t correspond to something they’ve been trained to deal with.”

AI on the edge versus the data center
AI on the edge versus in the data center is becoming a much bigger deal. AI is moving out of large data centers into edge devices, and the goal now is to shrink the footprint of what’s needed for a particular application and the power needed to achieve results within an acceptable time frame.

“AI on the edge is a hot topic, and it depends on what you’re doing,” Hand said. “If you look at security systems or inferencing on the edge, such as military systems, there’s a lot of actual learning that they’re doing on the edge. It’s basically shrinking down some of the data center capabilities and putting them up on the edge. The edge is largely dominated by either inferencing or data collection. You categorize your data, and then ship metadata back. You may have parts of a model on the edge that are not actually learning per se, but they’re categorizing the data so they can help the overall system improve. You could think through an example, such as a car understanding its environment, with corrective action is being taken. It knows enough to collect that data, understand the tagging of the data, and feed it back. In the data center, you’ve got a different set of challenges because you’re working with huge amounts of data, so it’s the same challenges but scaled up larger. This is why NVIDIA has been on a tear with their data infrastructure of late, and why Google and Tesla are building their own AI infrastructure.”

A lot of it comes down to scale and data patterns. “It’s not that different than why the hyperscalers got into SoC design for video and things like that,” he said. “While they had their own data patterns they wanted to accelerate, they had enough scale that they didn’t need to rely on someone else’s silicon. And the benefit was large enough that they didn’t want to rely on FPGAs. It couldn’t give them the throughput that they wanted.”

The same thing is playing out here. In some cases, they are creating architectures that are more generally applicable.

“Case in point: At the RISC-V Summit, Meta talked about some of the things they’ve been developing with AI accelerators based on infrastructure they’ve built for their own internal us,” Hand said. “It’s hidden from the users, but they’re building it themselves. What we’ll start to see is the scale, and the opportunity is such that people are going to build it. If you think of AI, it’s only as good as the data. The more data you can give AI, the smarter it will get, and the faster that will happen, which means it will want more data. When we look at the data curve, we’re on an exponential growth of data by 2030, and we’re just at the knee of that growth. There is so much data. It is going to change architectures, and not just at the chip level. It will change them at the system level and the data center level. It’s not that different than a decade ago, when people were debating east-west versus north-south data architectures in the data center. Now we’re looking at, ‘If we’re feeding it this monumental amount of data, what do we do?’ It opens up a whole slew of data privacy rules and laws about how you maintain data integrity in the system. How do you know if knowledge is bleeding out?”

Standards
New custom silicon designs are being announced almost weekly, which is proof of just how much appetite there is for these designs and how high the expectations are running for investors and systems companies alike.

“What’s interesting is that, when you go under the hood, there are differences between the choices these different hyperscalers are making at the low levels of hardware design,” said Alphawave’s Chan Carusone. “That’s an indication that it’s a new area. These hyperscalers may start rolling out these kinds of AI chips with wider use and higher volume. There’s going to be more desire to make use of standards to create the networks of these things. So far, everyone’s got a lot of proprietary solutions for networking them together. NVIDIA has their solution, Google has their solution, and so on. We’re also starting to see the appetite emerge to have standardized networks because as their need for volume grows, they’re going to want to be interoperable, and have a broader, more stable supply chain, and be able to grab stuff off-the-shelf at some point over having their own niche solution or closed ecosystem.”

As with all new technologies, standards will begin to take root in order to widen the market and reduce development costs. “Take Si2, for instance, with the work they’re doing around managing the AI data exchange,” said Siemens’ Hand. “There will be similar things that happen within the industry defining standards for debuggability. It’s a bit of the Wild West at the moment as everyone tries to identify how to use AI. Where does it give value? As it matures, it will be a bit like everything in the traditional compute space. It will settle down, and there will start to be best practices that are shared. There will be data security practices, and so on. We’re still very much in the Cambrian-explosion-of-ideas phase, which makes it extremely exciting. We’ll look back in a few years and say, ‘Oh, okay, now, it all makes sense.’ EDA, and semiconductor as a whole, are only revolutionary in hindsight. When you’re looking forward, it’s always evolutionary. We’re engineers, we just chip off the next thing bit by bit. It’s only when we look back that we see it was a revolution back then — a sea change. We just didn’t see it because we were just stepping through it bit by bit.”

Related Reading
Partitioning Processors For AI Workloads
General-purpose processing, and lack of flexibility, are far from ideal for AI/ML workloads.
Processor Tradeoffs For AI Workloads
Gaps are widening between technology advances and demands, and closing them is becoming more difficult.



5 comments

allan cox says:

While partitioning and network connectivity are indeed important architectural choices—you make no mention of possibly the largest architectural contributor to edge PPA efficiency—-analog NN’s. Are you discounting them?

John Derrick says:

First some context. I’ve started and exited a couple companies in application specific compute after driving certain aspects of processor architecture at IBM. Secondly, I’ve been involved in AI for a while with a few companies exiting that I advised / helped formulate their approach.
With all of that and the advancements in compute / software models for AI, I strongly believe the biggest challenge in AI is not just the results, but some type of linkage back to why. Working on that now.

Ann Mutschler says:

Hi Allan, thank you for your comment. I plan to look into this next. There’s only so much I could fit into this article. Thanks for reading.

Ann Mutschler says:

That is fascinating, John. I’d be interested to talk with you more about this. If that is of interest to you, please reach out. [email protected]

Manil Vasantha says:

Hallucination, model contamination, and Overfitting are all real problems. Large LLM may no longer be a solution. There may be smart SLMs with intelligent Accelerators that act as routers and route the traffic based on the topic.

Leave a Reply


(Note: This name will be displayed publicly)