When To Expect Domain-Specific AI Chips

With the intended application evolving faster than silicon can be developed, optimizing hardware becomes a delicate balance.

popularity

The chip industry is moving toward domain-specific computation, while artificial intelligence (AI) is moving in the opposite direction, creating a gap that could force significant changes in how chips and systems are architected in the future.

Behind this split is the amount of time it takes to design hardware and software. In the 18 months since ChatGPT was launched on the world, there has been a flood of software startups exploring new architectures and technologies. That trend is likely to continue given the rate of change for tasks being mapped onto them. But it often takes longer than 18 months to produce a single customized chip.

In a world of standards, where software does not change much over time, it pays to customize hardware to meet the exact needs of an application or workload, and little else. This is one of the major drivers behind RISC-V, where the processor ISA can be designed specifically for a given task. But with many flavors of AI, changes are so rapid that hardware already may be outdated by the time it reaches volume manufacturing. So hardware that is specifically optimized for an application is unlikely to reach the market quickly enough to be useful unless the specification is constantly updated.

As a result, the risk that a domain-specific AI chip will not work correctly on the first pass increases. And while it is being fixed, generative AI will have moved on.

This by no means spells the end for custom silicon. Data centers are deploying a widening range of processing architectures, where each of them is better at a given task than a single general-purpose CPU. “With the explosion of AI workloads in the data center, even the last fortress of vanilla compute horsepower has crumbled as data center chips and systems are forced to adapt to a rapidly evolving landscape,” says Steve Roddy, chief marketing officer at Quadric.

But it does point to architectures that are a balance of ultra-fast, low-power silicon combined with more general-purpose chips or chiplets.

“In the AI space, there is a very strong drive toward making things as general and programmable as reasonable, simply because nobody knows when the next LLM thing will come out and totally change the way they want to do things,” says Elad Alon, CEO of Blue Cheetah. “The more you bake in, the more you’re possibly going to miss the wave. At the same time, it’s abundantly clear that it’s nearly impossible to satisfy the compute, and therefore power and energy, associated with using fully general-purpose systems. There’s an incredibly strong drive to customize the hardware to be much more efficient at the particular things that are known today.”

The challenge is to efficiently map software onto this heterogeneous array of processors, which so far the industry has not yet fully mastered. The more processor architectures that co-exist, the more difficult the mapping problem can become. “You have a GPU, a neural processing unit in modern chips, and you have core processing,” said Frank Schirrmeister, vice president of solutions and business development at Arteris at the time of this interview (and currently executive director for strategic programs and systems solutions at Synopsys). “You have at least three compute options, and you have to decide where to put things with the appropriate abstraction layer. We used to call that software-software co-design. When you have ported the algorithm, or a part of the algorithm to be done in the NPU or GPU, you reshuffle the software and move more of your software execution into something that is more efficient for the implementation. There remains a generic component of the compute supporting different elements.”

Chasing the leader
The advent of AI was enabled by the processing power of the GPU, with functions required for graphics processing fairly closely matching those required by the core part of AI. In addition, the creation of a software tool chain that enabled non-graphics functions to be mapped onto the architecture made NVIDIA GPUs the easiest processor to target.

“When someone becomes the market leader, and they may be the only game in town, everybody tries to react to it,” says Chris Mueth, new opportunities business manager at Keysight. “But that does not mean it is the most optimal architecture. We may not know that for a while. GPUs are suited for certain applications, such as doing repetitive math operations, and for that it is hard to beat. If you optimize your software to work with the GPU, it can be blazingly fast.”

Being the general-purpose leader can create headwinds. “If you’re going to build a general-purpose accelerator, you need to worry about future-proofing,” says Russell Klein, program director for high-level synthesis at Siemens EDA. “When NVIDIA sits down to build a TPU, they have to make sure that the TPU can address the broadest possible market, which means anybody who dreams up a new neural network needs to be able to drop it into this accelerator and have it run. If you are going to build something specific to one application, you don’t need to future-proof it nearly as much. I may want to build in a little flexibility, so I have the capability of fixing problems. But if that just gets nailed down to a very specific implementation that performs one job really well, then in another 18 months somebody’s going to come up with a brand new algorithm. The good news is that I’m going to be ahead of everybody else, using my customized implementation until they can catch up with their own customized implementation. There’s only so much we’re going to be able to do with off-the-shelf hardware.”

But specificity also can be built in layers. “Part of the delivery of IP is the hardware abstraction layer that it exposes to the software in a standardized way,” says Schirrmeister. “A graphics core is of no use without some middleware. The application specificity moves upwards in abstraction. If you look at CUDA, the NVIDIA cores by themselves are fairly generic in their compute capabilities. CUDA is the abstraction layer, which then has libraries for biology, for all kinds of things on top of it. And that’s brilliant because the application specificity moves up to a much higher level.”

Those abstraction layers have been important in the past. “Arm consolidated the software ecosystem on top of the application processor,” says Sharad Chole, chief scientist and co-founder of Expedera. “After that, heterogeneous computing enabled everyone to build their own additions to that software stack. Qualcomm’s stack is completely independent of Apple’s stack, for example. If you stretch that ahead, there is an interface that can be utilized to get even better performance or a better power profile. Then there is a room for co-processors. Those co-processors will allow you to do even more differentiation than just building with heterogeneous computing, because then you can add it or remove it, or you can actually build a newer co-processor without spinning a new application process, which is much more costly.”

Economics is an important factor. “The proliferation of fully programmable devices that accept C++ or other high-level language, and function-specific GPUs, GPNPUs, and DSPs reduces the need for dedicated, fixed-function and financially risky hardware acceleration blocks in new designs,” says Quadric’s Roddy.

This is a business issue as much as a technology issue. “One may say I am going to do this very specific targeted application and, in that case, I know that I’m going to do these following sets of things in the AI, or other stack, and then you just make them work,” says Blue Cheetah’s Alon. “If that’s a large enough market, it may be an interesting choice for a company to make. But for an AI accelerator or AI chip startup, that’s a trickier bet. If there isn’t enough market to justify the whole investment, then you have to project the capabilities required for markets that do not yet exist. It’s really a mix of what type of business model and bets are you placing, and therefore what technological strategy one can take to optimize that as best as possible.”

The case for dedicated hardware
Hardware implementation requires choices. “If we could standardize the neural networks and say this is all we are going to do, you still have to consider the number of parameters, the number of operations that are necessary, the latency needed,” says Expedera’s Chole. “But that’s never the case, especially for AI. From the beginning, we started from postage stamp images like 224 x 224, then move to HD, and now we are going to 4k. Same thing with LLMs. We started with a 300-megabit model like Bert, and now we are going towards billions and billions — or a trillion — parameters. Initially we started with only language translation models like token prediction models. Now we have multimodal models where we can support language, plus vision, plus audio simultaneously. The workload continues to evolve, and that is the chase game that is happening.

There are many aspects of existing architectures that can be questioned. “A key part in designing a good system is finding the salient bottlenecks in system performance and finding ways to accelerate them,” says Dave Fick, CEO and cofounder of Mythic. “AI is an exciting and impactful technology. However, it requires performance levels measured in trillions of operations per second and memory bandwidth that is completely unsupportable by standard cache and DRAM architectures. This combination of being both useful and challenging makes AI a prime candidate for specialized hardware units.”

There are not enough general-purpose devices to meet demand, which may be the factors that forces the industry to start adopting more efficient hardware solutions. “The progress that is happening in the generative AI field is extremely fast,” says Chole. “There is nothing currently available that can keep up with the requirements for the hardware in terms of cost and power. There’s nothing. Even GPUs do not have enough shipments. There are orders, but not enough shipments. This is the problem that everyone is seeing. There is not enough computing power to actually support the generative AI workloads.”

Chiplets may help to alleviate this problem. “The coming tsunami of chiplets will serve to speed up that transition in the data center,” says Roddy. “As chiplet packaging replaces monolithic ICs, the ability to mix and match fully-programmable CPUs, GPUs, GPNPUs (general-purpose, programmable NPUs) and other processing engines for a given task will impact the data centers first, and then slowly radiate out into higher-volume, more cost-sensitive markets as the packaging costs of chiplets inexorably gets reduced when volumes ramp.”

Multiple markets, multiple tradeoffs
While most attention is focused on the large data centers, where new models are trained, the ultimate gains will be had by the devices that use those models for inferencing. Those devices cannot afford the huge power budgets used for training. “The hardware for training AI is somewhat standard,” says Marc Swinnen, director of product marketing at Ansys. “You buy NVIDIA chips and that’s how you train the AI. But once you have your model built, how do you execute the model in the end application, maybe on the edge. That is often a bespoke, custom chip for that particular implementation of that AI algorithm. The only way you’re going to get a high-speed, low-power implementation of your AI model is by building a custom chip for it. AI will be a huge driver of custom hardware for each one of these models in execution.”

They have a similar array of decisions to make. “Not every AI accelerator will be the same,” says Mythic’s Fick. “There are many great ideas for how to address the memory and performance challenges posed by AI. In particular, there are new data types going all the way down to 4-bit floating point or even 1-bit precision. It is possible to use analog computing to get extreme memory bandwidth for greater performance and energy efficiency. Others are looking at pruning neural networks down to the most critical bits to save memory and computation. All these techniques will produce hardware that is strong in some areas and weak in others. This means greater hardware-software co-optimization and the need to seed an ecosystem with a variety of AI processing options.”

And this is where AI and RISC-V’s interests converge. “When it comes to software tasks like LLMs, they will become dominant enough to force new hardware architectures, but won’t stop differentiation all together, not in the short term,” says Dieter Therssen, CEO of Sigasi. “Even the customization of RISC-V is based on the need to do some CNN or LLM processing. A key factor here will be how AI is deployed. Currently, there are so many ways to do so that imaging convergence is still too far out.”

Conclusion
AI is new and evolving so rapidly that nobody has definitive answers. What are the most optimal architectures for existing applications, and will future applications look similar enough that existing architectures only have to be scaled? That would seem to be a very naïve projection, but today it is probably the best bet for many companies.

The rapid rise of AI has been made possible by the GPU and the software abstractions built on top of it. It has provided an adequate framework for the expansion we have seen, but that does not mean it is the most efficient platform. Model development has in part been forced to go in the direction supported by the existing hardware, but as additional architectures do become available, AI and model development may diverge based on the hardware resources available and the power demands they place. It is likely that power will become the factor that steers both, because the current projection is that AI will soon consume a significant fraction of the world power generation capabilities. That cannot continue unabated.

Related Reading
Will Domain-Specific ICs Become Ubiquitous?
How shifts in end markets and device scaling could alter some fundamental assumptions in chip design.



Leave a Reply


(Note: This name will be displayed publicly)