Home
OPINION

What Do LLMs Want from Hardware

Simple: More, more, more of everything…

popularity


Figure 1: Noam Shazeer, Google Gemini vice president, presented this in his Hot Chips 2025 talk.

Noam Shazeer is Google’s vice president of engineering for Gemini, their LLM competitor to ChatGPT. He talked recently at Hot Chips: “Predictions for the Next Phase of AI.” He has worked on LLMs for a decade since inventing the transformer model in 2017. As his slide says, LLMs can take advantage of more of everything from hardware to improve performance and accuracy.


Figure 2: Jensen Huang GTC 2025 Keynote

It was just six months ago that Jensen Huang talked about data center CapEx reaching $1 Trillion or more by 2028. On Nvidia’s recent earnings call, Huang now sees $3 trillion to $4 trillion AI infrastructure spending over the next five years! These are colossal growth rates for an already huge market. This is the gold rush of our lifetimes. [Note: Figure 2 shows data center CapEx is more than AI semiconductors because it also includes land, buildings, power, racks, and more.]

Let’s look deeper to understand why LLMs are ravenous for compute and connectivity and the options that are appearing to deliver more, more, more. There is no shortage of new ideas.

LLMs are Driving Data Center Growth

ChatGPT, Claude, Gemini, Llama, and other LLMs are behind the incredible ramp of data center CapEx. These are called foundation models because they deliver the best results. Annual recurring revenue (ARR) is growing exponentially. OpenAI ARR was $5B/year at the start of 2025 and doubled to >$10B/year in mid-2025. Anthropic grew from $1B/year at the start of 2025 by 5x to $5B/year mid-2025.

Gemini (Google) and Llama (Meta) are growing rapidly, too. They use a huge number of parameters, have growing context window size (the amount of text in tokens that a model can “remember” at any one time, which limits the size of documents or code that can be processed) and growing KV Cache demands (tokens are generated one at a time; the KV Cache stores and re-uses key and value vectors from previous tokens rather than recomputing them for each new token). Deep Research mode has the models “think longer” to refine and other models to check initial results to get a more thorough analysis and accuracy. All of this drives the demand for more hardware.

Despite this tremendous ramp in complexity and compute, the cost per query is going down, which drives the demand for more.

These LLM models are easy to use casually, but to get the most out of them isn’t trivial. Prompt engineering is the new field of experts who can get the best results most efficiently out of these models.

A recent article talked about >90% of big business AI experiments failing. This doesn’t mean it doesn’t work. It means there is a learning curve, and some companies are figuring it out faster than others. The competitive advantage will shift to the fast learners. Salesforce just announced a cut of 4,000 customer service roles because AI agents can do their work. Some other companies say AI tools will increase productivity and reduce the need for much future hiring.

Training LLMs has very different hardware requirements than inference. For example, in training there are far more GPUs involved, often spanning multiple data centers, and the “all-gather” cycle means that thousands of GPUs sit idle waiting for the last GPU to report.

Networking is critical for training. Inference involves fewer GPUs and many more models at a time. It used to be that training consumed most data center resources, but now that ChatGPT and others have soaring demand, inference workloads will dominate going forward. Maybe an 80% inference, 20% training split is likely in future years.

Not all LLMs are Frontier. There are many companies building their own models for specialized purposes. For example, if you are a company like Bosch with many kinds of appliances and call centers fielding questions, you can train a model on all of your public and internal documentation so your call center people can get to the right answer faster. Why pay for a Frontier Model that knows Shakespeare and Chinese when you can get a smaller, cheaper one that just knows what you want.

“Build an LLM from Scratch,” by Sebastian Raschka is a good book that dives into LLM details. You can get it on Amazon. I’m slogging through it now to understand better how LLM architecture drives hardware needs.

I recently heard of a new type of large language model — Diffusion LLM. Mercury Coder claims 5 to 10x better performance (tokens/second). A venture capitalist I was meeting recently told me that AMD’s initial success in GPU sales has something to do with this, as these models don’t need as many GPUs. And hence, AMD’s current disadvantage in large scale-up isn’t an issue. You can learn more by Googling, “What is Diffusion LLM and what it matters” by Zheng “Bruce” Li.

More PetaFlops (PFlop = 1 quadrillion floating-point operations/second)

This is the part of AI hardware that most people are most aware of — huge compute provided by Nvidia GPUs and now AMD GPUs, as well as custom accelerators made by the hyperscalers.

On Nvidia’s recent earnings call it was disclosed that more than half of their data center revenues comes from three companies — probably Amazon AWS, Google Cloud, and Microsoft. Each of these companies is buying more than $10 billion/year of Nvidia GPUs. They can afford to build their own custom accelerators (even if that costs $500 million-ish/year).

Hyperscalers (and OpenAI) are building their own custom accelerators for two reasons:

  1. They can use ASIC companies with lower margins than Nvidia to cut costs and get leverage in negotiations, and
  2. They know their LLM models and their needs and can optimize their hardware.

Hyperscalers run a lot of customer workloads written for Pytorch, which only maps today to Nvidia and recently AMD GPUs. Even with a great custom accelerator, they need to buy GPUs for these workloads, but will be incentivized to give business to AMD where they can so as to create GPU price competition.

There are some brave startups getting funded, like D-Matrix and Positron, to build data center AI compute optimized for niche markets such as local, small LLMs.

More Memory Capacity & Bandwidth (all levels of hierarchy)

When you see a “die photo” of an AI Accelerator, you always see the GPU chiplets with HBM (high-bandwidth memories) on at least two sides. HBM to GPU connections are very wide and very fast. Without HBM, GPUs would starve for data, which is why HBMs cost ~10x more per bit than DDR DRAM. HBMs are an engineering marvel. They’re already at 16-high stacks! And HBMs keep increasing the bandwidth delivered by a combination of more connections and higher data rates per connection.

As a memory guy pointed out to me, there are more transistors of memory on an accelerator than GPU (look at the respective areas and remember there are up to 16 die per HBM, so the total memory silicon area is larger than logic). Integrating HBM with GPU on a silicon interposer is what initially drove TSMC’s multi-chip packaging. It would seem HBM’s ability to keep increasing capacity and bandwidth may be hard given how much they’ve already done, but the money involved here is huge, so I expect we’ll see further innovations.

Interestingly, rumor is that OpenAI is going to use 8-high HBM for inference. Bandwidth is more important than capacity, and 8-high has a better bandwidth per capacity.

There is so much memory demand — billions of weights, growing context windows, growing KV Cache sizes — that a memory hierarchy has developed where the most often used KVs are in HBM. Others are in more distant memory based on relative demand. This is reminiscent of L1/L2/L3 caching for CPUs. This is what Nvidia’s Dynamo Distributed KV Cache Manager does, allocating KVs to HBM, DRAM, or NVMe. Smart allocation boosts tokens/$ significantly.

DRAM today is attached to the CPU that the GPU is connected to (often two GPUs per CPU). The data transfer rate between the CPU DRAM and the GPU is relatively slow over PCIe. Eliyan is proposing having the back side of a custom HBM connect to LPDDR (lower power than DDR) to provide much higher bandwidth for DRAM into the accelerator.


Figure 3: Eliyan.com

Finally, I’ve been hearing the term “memory appliance” in recent months. The idea is to build a big pool of memory using DRAM, much cheaper than HBM, with a high-bandwidth connection to the GPU pod.

Enfabrica recently announced its memory fabric system for LLM inference. It connects to the CPU CXL interface with 400/800 gigabit/second data transfer and offers up to 18 terabytes of DDR5 DRAM per node.


Figure 4: Enfabrica.net EMFASYS Memory Fabric System for AI Inference

More Network Bandwidth (all levels of hierarchy)

There are multiple networks in an AI Data Center — scale up, scale out, and at Hot Chips I heard about Nvidia’s scale-across network.

In the “old days,” like 5 years ago, Ethernet connected everything in the data center. Every slot in the rack would attach to the TOR (top of rack router/switch), which would connect to all the other TORs in a row, and then a higher-level layer of switches.

Networking innovation is rampant now because running Frontier LLMs on GPUs requires very fast, very-high-bandwidth transfers across hundreds or thousands of GPUs.

Google presented at Hot Interconnects on the challenge of training and networking solutions. Training requires thousands of accelerators to work together. Training is broken down across the accelerators, but periodically all accelerators need to share results to synchronize weights. The last accelerator to respond keeps all the others accelerating. This is called tail latency at the 100th percentile. An ideal training network is scheduled and predictable. Firefly acts as a universal metronome, providing a sub-10 nanosecond clock synchronized across the entire data center!


Figure 5: Google Cloud slide from Hot Interconnects, August 2025

Another Google innovation is Falcon, which is incorporated in the Intel SmartNIC E2100. It enables a “timing wheel,” which paces packet input into the network to reduce congestion, like the green/red lights at large city freeways.

In Ethernet, the king of the hill is Broadcom’s Tomahawk, which is used in most switches like market leader Arista’s. At Hot Chips, Broadcom said the Tomahawk Ultra is now shipping with 512 ports of 100G-PAM4 each. Tomahawk Ultra will enable faster Ultra Ethernet switches for scale-out networks.

Broadcom is also promoting Tomahawk Ultra for scale-up networking (SUE = scale-up Ethernet). NVLink only works with Nvidia, although they have indicated a willingness to let other AI accelerators connect using their proprietary interface – no one has announced plans to do so, so far. Tomahawk Ultra SUE is the only scale up solution today for non-Nvidia players. Tomahawk SUE added Link Layer Retry. Previously dropped packets were handled higher up the stack with much longer latencies. Link Layer Retry is in UALink, and probably NVLink. Also added was credit-based flow control – like in the UALink spec. Several other features for scale-up have been added. Latency is higher than UALink, but not by much (250ns vs 200ns), at least per the slides. Broadcom implied they have customers that are designing pods of 1K or even 2K GPUs, and that 2 layers of switches are being considered.

Other switch players focusing on scale-up are Marvell, Astera Labs, and startups like Xconn.

UALink is the scale-up interconnect being promoted by AMD and others for the non-Nvidia ecosystem. The UALink spec is very detailed, with several hundred pages. Implementations are underway at multiple companies. No one has announced general availability yet. UALink is designed to connect up to 1,024 GPUs (But whether that is achievable is a function of the interconnect. Copper can’t do it.)

There is also a market for custom scale-up interconnect for hyperscaler accelerators. Rumors are that AWS wants switches optimized for their scale-up strategy for their Trainum accelerators.

Huawei also presented at Hot Chips. Under current U.S. regulations they are cut off from the newest Nvidia technology. They proposed a unified bus built on Ethernet that eliminates protocol conversion, resulting in lower network latency.

Google has a very different networking approach for their TPUs. At Hot Chips they described their new Ironwood TPU and its interconnect.


Figure 6: Google Ironwood TPU presentation at Hot Chips August 2025

Google TPUs have used a hypercube interconnect from the start. Each TPU has 6 high-speed interconnects. In the simplest hypercube, TPUs are connected in a 2 x 2 x 2 cube. Each TPU connects to adjacent TPUs in X, Y and Z dimensions. In a 2 x 2 x 2 cube, all TPUs connect to all others directly. In larger hypercubes there are multiple hops between TPUs. TPU configurations can be dynamically adjusted for model size up to 8,192 TPUs. Now, as shown above, the inter-rack connections are pluggable optics, and an optical switch is added for connection to a very large memory pool shared across all TPUs.

Scale-across networking is a new term I first heard in Nvidia’s Hot Chips presentation, “Co-Packaged Silicon Photonics Switches for Gigawatt AI Factories.” This is the first deployment of co-packaged optics (CPO). The motivation for CPO, as they explained, is to dramatically lower power compared to pluggable optics. Power is a key limiter in data centers. Every watt saved means more GPUs can be installed within a given data center power budget. Another advantage of optics is that switches in data centers separated by many kilometers can interoperate. This is often required for training increasingly large Frontier LLMs.


Figure 7: Nvidia Spectrum XGS Ethernet switches – Hot Chips August 2025

They presented data of their new switch vs. off-the-shelf Ethernet switches, showing double the bandwidth 10km apart, especially for very large message sizes.

In summary, LLMs will drive many years of growth and hardware innovation

This is the Gold Rush for Semiconductor and Systems companies that can innovate and provide the solutions for more compute, more memory, more bandwidth (at competitive costs and power). The market growth will continue to be staggering for the rest of the decade, at least. Winners can grow fast and very big.



Leave a Reply


(Note: This name will be displayed publicly)