Patterns And Issues In AI Chip Design

Devices are getting smarter, but they’re also consuming more energy and harder to architect; change is a constant challenge.

popularity

AI is becoming more than a talking point for chip and system design, taking on increasingly complex tasks that are now competitive requirements in many markets.

But the inclusion of AI, along with its machine learning and deep learning subcategories, also has injected widespread confusion and uncertainty into every aspect of electronics. This is partly due to the fact that it touches so many different devices and processes, and partly because AI itself is constantly changing.

AI spans everything from training algorithms to inferencing. It includes massive training programs, and tinyML algorithms that can fit into a tiny IoT device. In addition, it is used increasingly in many aspects of chip design, as well as in the fab to correlate data from manufacturing, inspection, metrology and testing of those chips. It’s even used in the field to identify patterns in failures that can be fed back into future designs and manufacturing processes.

Within this broad collection of applications and technologies, there are several common goals:

  • Reducing the amount of energy required for AI/ML/DL computations;
  • Quicker time to results, which requires more parallelization and throughput, as well as fundamental architectural changes in both hardware and software, and
  • Improved accuracy of those results, which affects both power and performance.

Higher efficiency
With any flavor or application of AI, performance per watt or per operation is a critical metric. Energy needs to be generated and stored to perform AI/ML/DL computations, and there an associated cost in terms of resources, utilities, and area.

Training of algorithms typically involves massive parallelization of multiply/accumulate operations. Efficiency comes through the elasticity of compute elements in hyperscaler data centers — being able to ramp compute resources as needed, and to shift them to other projects when they’re not — as well as more intelligent use of those resources coupled with increasingly granular sparsity models.

Jeff Dean, chief scientist at Google, pointed to three trends in ML models — sparsity, adaptive computation, and dynamically-changing neural networks. “Dense models are ones where the whole model is activated for every input example or every token that is generated,” he explained, in a presentation at the recent Hot Chips conference. “While they’re great, and they’ve achieved important things, sparse computation will be the trend in the future. Sparse models have different pathways that are adaptively called upon as needed.”

What’s changing is a recognition that those sparse models can be partitioned across processing elements more intelligently. “There’s no point in spending the same amount of compute on every example, because some examples are 100 times as hard,” said Dean. “So we should be spending 100 times the computation on things that are really difficult as things that are very simple.”

Fig. 1: Adaptive computation with granular sparsity. Source: Google/Hot Chips 2023

Fig. 1: Adaptive computation with granular sparsity. Source: Google/Hot Chips 2023

Resources and compute models at the edge are very different, but the same basic principles of abstraction, customization, and right-sizing still apply.

Abstraction is more about looking at tradeoffs, both locally and at the system level. For example, it’s possible to essentially hard-wire some elements of a processor or accelerator while also providing enough flexibility to incorporate future changes. This is particularly useful where one part may be in multiple applications, and where the life expectancy of a chip is long enough to warrant some level of programmability. It’s a similar approach to some of the analog IP developed for advanced-node SoCs, where most of the architecture is digital.

“It’s important that the memory or the data path feeding in and out of these hard-wired blocks can support the permutations we need, because a lot of times with AI workloads the access patterns may be kind of wonky,” said Cheng Wang, CTO and co-founder of Flex Logix. “It’s also very common for AI where you need to add some offsets as some scaling factor to the data before feeding it into the engine. And the engine, of course, is hard-wired, and the output has to go through some flexible activation functions and gets routed either to SRAM or DRAM, or both, based on the demands of the workload. So all of those flexibilities are required and need to be in place to keep the MAC efficient. For example, if your memory bandwidth is insufficient, then you will have to stop, in which case it doesn’t matter how fast the MAC is. If you’re stalling, you’re going to be running at the speed of the memory, not the computer.”

Rightsizing
Memory architectures are changing for similar reasons. “AI is increasingly being used to extract meaningful data and monetize it,” said Steven Woo, fellow and distinguished inventor at Rambus, during a recent presentation. “It really needs very fast memory and fast interfaces, not only for the servers, but also for the acceleration engines. We’re seeing relentless demand for faster-performing memory and interconnects, and we expect that trend to continue far into the future. We’re seeing that the industry is responding. Data centers are evolving to meet the needs and the demands of data-driven applications like AI and other kinds of server processing. We’re seeing changes in the main memory roadmap as we transition from DDR4 to DDR5, and we’re also seeing new technologies like CXL come to market as data centers evolve from more captive resources into pooled resources that can improve computing beyond where we are today.”

The same kinds of trends are redefining the edge, as well. “The chipset makers are working with the silicon development team to look at it from the system perspective for performance and power consumption,” said C.S. Lin, marketing executive at Winbond. “So for this kind of product, what kind of bandwidth do you need? And what kind of process is required on the SoC side, and what kind of memory? All of this needs to be paired together to achieve speeds of 32 gigabits per second, for example (for NVMe PCIe Gen 3). And then, in order to do that, you need to integrate a protocol inside a chip, and only the most advanced processes are capable of providing this kind of thing.”

Regardless of whether it’s the cloud or the edge, customization and rightsizing increasingly are required for AI applications. Nearly all training of algorithms is done in large data centers today, where the number of MAC functions can be ramped up or down, and computations can be partitioned across different elements. That may change, as algorithms become more mature, sparser, and increasingly customized. But the majority of the computing world will leverage those AI algorithms for inferencing, at least for now.

“About 75% of all the data that’s generated by 2025 is going to come from the edge and the endpoint of the network,” said Sailesh Chittipeddi, executive vice president at Renesas, during a panel discussion at SEMICON West. “Your ability to predict what happens at the edge and the endpoint are really what makes a tremendous difference. When you think about compute, you think about microcontrollers and microprocessors and CPUs and GPUs. The latest buzz is all about GPUs and what’s happening with GPT3 and GPT4. But those are large language models. For most datasets, you don’t require such tremendous amounts of processing power.”

One of the challenges at the edge is rapidly discarding useless data and only keeping what is needed, and then processing that data more quickly. “When AI is on the edge, it is dealing with sensors,” said Sharad Chole, chief scientist and co-founder of Expedera. “The data is being generated in real-time and needs to be processes. So how sensor data comes in, and how quickly an AI NPU can process it changes a lot of things in terms of how much data needs to be buffered, how much bandwidth needs to be used, and what the overall latency looks like. The objective is always the lowest possible latency. That means latency from the sensor input to the output, which maybe goes into an application processor for even further post-processing, should be as low as possible. And we need to make sure we can provide that data as a guarantee in a deterministic fashion.”

The price of accuracy
For any AI application, performance is a measure of time to results. AI systems typically partition computation among multiply/accumulate elements to run in parallel, then collect and conflate the results as quickly as possible. The shorter the time to results, the more energy required, which is why there is so much buzz around customization of processing elements and architectures.

In general, more compute elements are required to produce more accurate results in less time. That depends to some extent on data quality, which needs to be both good and relevant, and it requires algorithms to be trained appropriately for the task. A general-purpose processor is less efficient, and the same is true for a general-purpose algorithm. In addition, for many end applications, the amount of AI — including sub-categories like machine learning and deep learning — may be limited by the overall system design.

This is an area ripe for architectural improvements, and some innovative tradeoffs are starting to show up. Arm, for one, has created a new Neoverse V2 platform specifically for cloud, high-performance computing, and AI/ML workloads, according to Magnus Bruce, lead CPU architect and fellow at Arm. In a presentation at the recent Hot Chips conference, he highlighted the decoupling of branch prediction from fetch in order to improve performance in the branch-prediction pipeline, with advanced prefetching that includes accuracy monitoring. Put simply, the goal is to get much more granular about predicting what a chip’s next operations will be, and to shorten recovery time when there is a misprediction.

Fig. 2: Architectural and microarchitectural efficiency based on improved accuracy. Source: Arm/Hot Chips 23

Fig. 2: Architectural and microarchitectural efficiency based on improved accuracy. Source: Arm/Hot Chips 23

Designing with AI
In addition to architectural changes, AI may be able to help improve the accelerate the design of the hardware.

“The underlying metrics that customers care about are still power, performance, area, and schedule,” said Shankar Krishnamoorthy, general manager of the EDA group at Synopsys. “But what’s changed is that the engineering cost to achieve that shot up dramatically because of load complexity, design complexity, and verification complexity. We’ve had several customers tell us it’s essential 4X more work. They can barely add another 10% or 20% more engineers, and so who’s going to close that gap? That’s really where AI stepped in, and it has become a big disrupter in terms of helping address that problem.”

Others agree. “AI/ML is a hot topic, but what markets does it change and shake up that people hadn’t previously thought about? EDA is a great example,” said Steve Roddy, vice president of marketing at Quadric. “The core of classic synthesis/place-and-route is making a transformation from one abstraction and lowering to the next. Historically that was done with heuristics, compiler creators and generators. Suddenly, if you can use a machine learning algorithm to speed that up or get better results, you’ve completely disturbed an existing industry. Does the emergence of machine learning shake up some existing silicon platform? Will my laptop continue to have quad-core processors, or will it suddenly have machine learning processors doing a bunch of the work on a regular basis? Graphics has been a constant race to have higher graphics generation for crisper resolution on phones and TVs, but people are increasingly talking about deploying machine learning upscaling. So you render something with a GPU at a much lower resolution and upscale it with a machine learning algorithm. Then it’s no longer how many GPUs can you pack into a cell phone and stay within a power bundle. It’s, ‘Let me let me go back five generations and have a much smaller, more power-efficient GPU and upscale it, because maybe the human eye can’t see it.’ Or, depending upon lighting and time of day, you upscale it in a different fashion. Those kinds of things throw off the calculus of what’s the standard.”

This could become particularly useful for speeding up complex modeling of designs, particularly where there are a lot of different compute elements on the same die or in the same package. “If you take too many dependencies into your model it takes more time to simulate them than the reality,” said Roland Jancke, design methodology head in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Then you’ve over-engineered the model. But it’s always the question of modeling to be as abstract as possible and as accurate as needed. We’ve suggested for a number of years that to have a multi-level approach so you have the models of different levels of abstractions, and where you want to really investigate you go deeper into more details.”

AI may help significantly because of its ability to correlate data, and that in turn should bolster the market for AI because the design process can be automated both for developing AI chips, and for the chips themselves.

“The AI chip community is about $20 billion to $30 billion in revenue today, and that is expected to grow to $100 billion by the end of the decade,” said Synopsys’ Krishnamoorthy. “[On the EDA side], it’s about how to optimize designs to get better PPA and to achieve expert quality results with engineers who are earlier in their experience. In the case of verification, it’s achieving higher levels of coverage than what they’re achieving with the current methods, because AI autonomously searches a larger space. In the case of test, it is reducing pattern counts that go on the tester, which directly translates to cost of test and time of test. And in the case of custom design, it is automatically migrating analog circuits from 5nm to 3nm, or 8nm to 5nm. In the past, this used to be a manual effort.”

The price of customization
But there also are many variables and unanticipated results, even in the best-designed systems, and they can affect everything from datapath modeling to how MAC functions are partitioned across different processing elements. For example, that partitioning may be perfectly tuned in the fab or packaging house, but as the processing elements age they can fall out of sync, leaving some of them idling and burning power while waiting for others to complete their processing. Likewise, interconnects, memories, and PHYs may degrade over time, creating timing issues. And to make matters worse, almost constant changes in algorithms may have big impacts on overall system performance well beyond the individual MAC elements.

Over the past decade, many of these issues have been dealt with inside of large systems companies, which increasingly are designing their own chips for internal use. That is changing as more computing moves to the edge, where power consumption has a direct effect on how many miles a vehicle will drive per charge or how useful a wearable device will be if it’s doing more than the most basic operations.

The key here is understanding how much AI to incorporate into these designs, and what exactly that AI is supposed to do. An efficient SoC typically turns on and off various components, using processing cores that may be dark or “warm” on an as-needed basis. But an efficient AI architecture keeps many processing elements running at maximum speed as it decomposes multiply/accumulate computations into parallel operations and then collects the results. If computation in any one of those elements is delayed, it wastes time and power. Done right, this can result in blazing fast computations. However, that speed does come at a cost.

One of the problems is that learning isn’t being widely shared across the industry because many of these leading-edge designs are being developed for internal use at systems companies. That is slowing the knowledge transfer and industry learning that typically happened with each new rev of a processor family, or consumer products reviewed by users in the market.

Conclusion
While there is plenty of buzz surrounding AI/ML/DL, it’s no longer hype. It’s being used in real applications, with more on the way, and it will only improve in efficiency, performance, and accuracy as design teams figure out what works best and how to apply it to their design. There will almost certainly be some hiccups and more uncertainty, such as how AI ages over time as it adapts and optimizes systems. But there doesn’t appear to be any doubt AI is here for the foreseeable future, and that it will continue to get better wherever there are enough resources and interest.

“The real use cases you see today are occurring every day, starting even with voice processing,” said Renesas’ Chittipeddi. “That would not have been possible 10 years ago. What’s changed fundamentally is the ability to apply AI to real use cases. It’s transforming the landscape.”

Related
Sweeping Changes For Leading-Edge Chip Architectures
Large language models and huge data volumes are prompting innovation at every level.
How Chip Engineers Plan To Use AI
Checks, balances, and unknowns for AI/ML in semiconductor design.



2 comments

Roger says:

Thank you for your well researched article, it poses many pertinent questions for ai and ml.
Where, in this puzzle do you see the spiking neural network of Brainchip’s Akida technology? Or is there other options that are more versatile and robust that show promise in finding solutions?
Best wishes Roger

Ed Sperling says:

Roger, there are two challenges. First, chipmakers need to figure out what exactly they’re going to be using AI for and why. There are lots of parallel MAC processing elements. The bigger issue is how to size them for a realistic use case, rather than just building the biggest, baddest machine. In many cases, that’s like driving a Ferrari in first gear. Second, the AI/ML/DL itself needs to be more efficient. This is a combination of hardware-software co-design, more computing per cycle, more efficient interaction with memory, sparser algorithms, better branch prediction, and only the precision needed for the job. And when it comes to large language models, which are basically long-line searches, all of this needs to be considered.

Leave a Reply


(Note: This name will be displayed publicly)