New AI Processors Architectures Balance Speed With Efficiency

Hot Chips 24: Large language models ratchet up pressure for sustainable computing and heterogeneous integration; data management becomes key differentiator.

popularity

Leading AI systems designs are migrating away from building the fastest AI processor possible, adopting a more balanced approach that involves highly specialized, heterogeneous compute elements, faster data movement, and significantly lower power.

Part of this shift revolves around the adoption of chiplets in 2.5D/3.5D packages, which enable greater customization for different workloads and data types, and better performance per watt. Alongside of that, leading-edge chipmakers used the recent Hot Chips 24 conference to show off novel micro-architectures, improvements in pre-fetch and branch prediction, faster memory access, and more intelligent data management on and off chip.

Chipmakers are reacting to growing pressure by government agencies and utilities to curb the amount of energy required to process large language models, due to the limitations of today’s power grids. They also are leveraging those heterogeneous architectures to tap into new opportunities in both hyperscale and edge data centers.

These new designs also take aim at NVIDIA’s near monopoly in the AI world due to the proliferation of inexpensive GPUs and the CUDA-based models built atop of them. No general-purpose processor is as energy-efficient as customized accelerators, and rather than a single type of chip, most of the multi-chiplet architectures presented at Hot Chips this year include more than one type of processor, more expansive memory and I/O configurations to limit bottlenecks, and much more efficient data management.

NVIDIA is well-aware of these competitive threats, of course, and the company certainly is not standing still. Its new Blackwell chip combines GPUs with a CPU and DPU, and its quantization scheme opens the door to low-precision AI in addition to its blazing fast training capabilities, which are needed to handle much larger data models.

Fig. 1: AI models have grown by a factor of 70,000 over a decade as new capabilities and parameters are added. Source: NVIDIA/Hot Chips 24

“Unlike general-purpose computing, AI and accelerated computing is a full-stack problem that requires reinventing how we look at software and computation from the ground up,” said Raymond Wong, director of GPU architecture at NVIDIA, in a Hot Chips 24 presentation. “The scale of the problem, and the performance we need, requires a new approach that looks at solutions from the top-down and bottom-up across hardware, system software, acceleration libraries, platform and application frameworks…It is vital that we learn how to optimize the chips, the networking, and the infrastructure.”

NVIDIA still has plenty of expansion opportunities ahead, but going forward it will face more competition on many fronts.

Changes in the data center
One of the big changes in processor designs this year is an intensive focus on data management. With AI, it’s no longer just about building massive arrays of redundant processing elements and running them as fast as possible. Increasingly, the goal is to intelligently prioritize data — there is much more data, and more types of data — but this approach isn’t new. In fact, it dates back to 1980, when Intel introduced its 8087 floating point co-processor. Arm expanded on that concept in 2011 with its big.LITTLE dual-core processor, with the smaller core targeted at less compute-intense jobs and the larger core there when needed.

This strategy has been refined since then through more complex partitioning and prioritization, but it’s not one typically associated with AI chips running in large data centers. That’s where most of the AI training happens, and it will likely continue to reside there for some time because developing large language models and repeatedly querying them requires enormous amounts of compute power. Still, not every compute cycle is processing intensive, and generative AI models do need to be queried as often as they are today.

Even IBM, which claims to handle 70% of all the financial transactions in the world through its mainframe computers, has shifted direction away from focusing just on tera operations per second (TOPS) to performance per watt (aka picojoules per second). This is especially noteworthy because, unlike large systems companies — which now account for an estimated 45% of all leading-edge chip designs — IBM sells its systems to end customers rather than just compute as a service.

IBM’s new Telum processor includes a mix of a data processing unit (DPU) for I/O acceleration — basically, funneling data to where it will be processed and stored — as well as innovative caching. In all, it contains 8 cores running at 5.5 GHz, 10 36-megabyte L2 caches, and a new accelerator chiplet.

“DPUs are being widely used in the industry for handling massive amounts of data very efficiently,” said Chris Berry, IBM distinguished engineer. “Mainframes handle massive amounts of data. One fully configured IBM z16 is able to process 25 billion encrypted transactions each and every day. That’s more encrypted transactions per day than Google searches, Facebook posts, and Tweets combined. That kind of scale requires I/O capabilities well beyond what typical compute systems can deliver. It requires custom I/O protocols to minimize latency, support virtualization for thousands of operating system instances, and can handle tens of thousands of I/O requests at any time.”

The new chip also features a 15% power reduction in the eight-core central compute complex, in part due to better branch prediction. That has been a consistent theme at Hot Chips conferences over the past couple years, where more accurate branch prediction and faster recovery from pre-fetch errors can improve performance. But the addition of a DPU takes that a step further, acting as a smart traffic cop for data. Berry noted the DPU is mounted directly on the processor chip, and can reduce the amount of power needed for I/O management by 70%. Add to that a 2MB scratchpad for each of the 32 cores in its accelerator chip, which he described as “simply a place to put data for later use,” and the improvements in performance per watt are noteworthy.

Fig. 2: IBM’s new Spyre accelerator architecture. Source IBM/Hot Chips 24

Intel likewise introduced its next-gen accelerator chip for AI training, the Gaudi 3, which features 4 deep learning cores (DCOREs), 8 HBM2e stacks, a matrix multiplication engine that is configurable as opposed to programmable. Add to that 64 tensor processing cores, and a memory sub-system that includes a unified memory space of L2 and L3 cache and HBM, near-memory compute, and an integrated software suite that allows customers to plug in custom TPC kernels, as well as support for PyTorch. It also connects two compute dies over an interposer bridge.

Intel’s approach to data management within the device is similar in concept to IBM’s. Intel uses a Sync Manager to dispatch work to designated units, and a Runtime Driver that sets work dependencies by configuring the Sync Manager. This approach maximizes resource utilization within the system, and it avoids any bottlenecks by asynchronously passing events using an interrupt manager.

“Each of the deep learning cores contains 2 MMEs (matrix multiplication engines), 16 tensor processing cores, and 24 megabytes of cache,” said Roman Kaplan, principal AI performance architect at Intel. “The main workhorse of the chip is the MME. It does all the operations that can be translated to matrix multiplication. This is a configurable — not a programmable —engine, meaning you don’t have any piece of code that’s running on this engine. There is a fixed set of registers that controls the operation of this unit, and according to these values, the unit just works. Each MME unit is basically a large output stationary systolic array.”

Fig. 3: Block diagram of Intel’s Gaudi 3 AI. Source: Intel/Hot Chips 24

AMD’s MI300X chip, which is aimed at AI systems, is based on a distributed AI system comprised of 12 chiplets, with 4 I/O dies and 8 accelerator dies, and likewise funnels data to where it is best processed. “Generative AI model performance requires memory capacity and bandwidth,” said Alan Smith, senior fellow and Instinct lead SoC architect at AMD. “So we targeted the scale of MI300X to meet this need, with a level of integration not possible with a monolithic design. Its fourth-generation Infinity fabric, PCI Express Gen 5, HBM3, and CDMA3 architecture, MI300X provides balanced scaling across compute, memory, and I/O subsystems.”

Fig. 4: AMD’s MI300X chiplet-based AI chip. Source: AMD/Hot Chips 24

Changes at the edge
In the past, the AI processing world was largely split between training in hyperscale data centers and inferencing in much smaller, often mobile devices. Increasingly, both training and inferencing are moving to the edge because of the cost of moving large amounts of data and the time it takes to get results from queries. While LLMs continue to expand, they’re not the only AI models being trained. Smaller, more domain-specific models can be trained using less intensive compute infrastructure, and inferencing can be done on devices with a battery.

This opens up a whole new market for heterogeneous designs using chiplets, not all of which will be developed by the same company or foundry. HBM is the first big success in this direction, but chiplets are being designed across a swath of different applications, similar to how soft IP has been used for the past couple decades. As with data center AI chips, managing data movement and memories is key.

Intel’s Lunar Lake is the chipmaker’s answer to mobile and desktop computing. Arik Gihon, CPU SoC hardware architecture lead at Intel, cited four main targets for the chip — energy efficiency, core performance, improved graphics, and overall performance, with up to 120 platform TOPS. Intel’s approach is to segregate the logic, using a compute tile and a platform controller tile in 2.5D configuration with memory-on-package.

“Lunar Lake is built off of two main tiles,” said Gihon. “One is a compute tile, and the second is a platform controller die. The compute tile is manufactured on TSMC N3B, and the PCB is on TSMC N6. The base die is on Intel’s 1227, and they are connected with Foveros technologies. We also introduced memory-on-package, which has two big advantages. One is power. It enables us to optimize the PHY specifically for low power, due to a low interconnect, as well as an optimization specifically for memory-on-package. The second is the footprint on the motherboard, which goes up to 250mm2.”

Qualcomm, meanwhile, developed its custom Oryon SoC along the same general lines. It includes three CPU clusters, each with four cores. Two of those are geared toward performance and one toward energy efficiency. What stands out in many of these presentations is the micro-architecture, which is basically how instructions are executed on the hardware. As with much larger systems, how and where data is processed and stored is central to many of these designs.

“[Oryon] has eight fundamental decoders in it, and they are preparing instructions for executions units, the load store unit, and the vector execution unit,” said Gerard Williams, senior vice president of engineering at Qualcomm. “The instructions themselves enter into a re-order buffer. It’s around 600 entries, which gives you an idea about how many instructions the machine is going to be managing in flight. And from a retirement point of view, there are eight instructions per cycle this machine can retire.”

Of particular note in Qualcomm’s chip is the memory management unit. “It’s backed by a very large unified Level 2 translation buffer, and this is done primarily to handle a large data footprint,” Williams said. “It’s meant to handle all of the virtualized structures, the security layers, but this structure is much larger than 8 kilo-entries, which is very non-typical. It’s meant to keep the latency of translation down to an absolute minimum.”

Fig. 5: Qualcomm’s Oryon SoC schematic, with dual emphasis on efficiency and speed. Source: Qualcomm/Hot Chips 24

Many of the presentations at the conference were from familiar names, but there were some newcomers, as well. Consider FuriosaAI, for example, a South Korea-based startup developing AI chips for the edge. The company developed what it calls a tensor contraction processor for sustainable AI computing.

“While training is all about AI models, inference is about services and deployment,” said June Paik, co-founder and CEO. “Our original design in 2021 was optimized for BERT-scale models. However, we soon made a major shift in our design with the emergence of foundation models such as GPT3 — a model five times larger than BERT. Even though those gigantic models were primarily in the research phase, we bet on their its potential. We believe people will gravitate toward the most capable AI models because more intelligent models offer greater benefits. We also believe that efficiency will become even more critical as we scale these applications. So we set out to make RNGD [pronounced Renegade] the most efficient inferencing chip.”

Moving data quickly back and forth to memory is central to its architecture, which is aimed at edge data centers. The company claims a memory bandwidth of 1.5 TB/second. RNGD also has two HBM3 stacks and 256 MB of SRAM, as well as a 48 GB memory capacity.

Fig. 6: Furiosa’s sustainable edge inferencing chip architecture. Source: Furiosa/Hot Chips 24

Conclusion
AI is just becoming useful, but there are still challenges ahead. First and foremost, AI has to be sustainable, a message well understood by large chip companies as evidenced by chip architectures presented at Hot Chips 24. But chips are only one piece of the solution.

Sustainability also requires more efficient software, improvements in micro-architectures so that large language model queries occur less frequently, and increasingly precise responses so that LLM responses can be trusted. In addition, it will require tighter integration of specialized processing elements in the form of chiplets, which are capable of processing different data types faster and more efficiently.

The bottom line: AI isn’t going away, but the efforts of the entire semiconductor ecosystem will be required for it to reach full potential.

Related Reading
Architecting Chips For High-Performance Computing
Data center IC designs are evolving, based on workloads, but making the tradeoffs for those workloads is not always straightforward.
When To Expect Domain-Specific AI Chips
With the intended application evolving faster than silicon can be developed, optimizing hardware becomes a delicate balance.
Chiplets: 2023 (EBook)
What chiplets are, what they are being used for today, and what they will be used for in the future.



1 comments

Lou Covey says:

Considering a single H100 consumes 700W of power, equal to the daily power consumption of an average American home, I’m not too impressed with a 15 percent reduction. Even a top-end gaming GPU burns 200W. My desk top burns 30W.

Leave a Reply


(Note: This name will be displayed publicly)