SPONSOR BLOG

AI Inference Needs A Mix-And-Match Memory Strategy

Matching memory technology to the inference workload phase is necessary to achieve the lowest cost per served token.

February 12th, 2026 - By: Raj Uppala

AI inference is no longer a single workload that can be served efficiently by a single type of accelerator or memory. From fast chat replies to 10M token codebases, inference spans wildly diverse workloads with very different limits on latency, bandwidth, capacity, and compute, as the figure below demonstrates.¹

Source: Meta¹

The AI inference spectrum of workloads includes:

Interactive LLMs (chat, copilots, agents) with strict latency targets
Long‑context reasoning (codebases, research, video) with massive KV (key value) cache footprints
Ranking and recommendation models with enormous embedding tables
Batch and offline inference where throughput and cost dominate over latency

These workloads stress hardware in very different ways. Recommendation models may exceed LLMs in parameter count but require orders of magnitude fewer FLOPs, while LLMs impose extreme pressure on memory bandwidth and latency during inference. Over the last decade, model size has grown much faster than memory capacity and bandwidth, turning AI from a chip problem into a full system‑design problem.

Inference is a dual-stage workload, with very different resources demands for each

Prefill processes many input tokens in parallel to build the KV cache and produces the first token. It is compute-bound but doesn’t saturate HBM bandwidth; moving prefill off HBM can cut cost without hurting Time-To-First-Token (TTFT) much. Decode then emits tokens one by one, repeatedly reusing KV. Here bandwidth and memory hierarchy dominate Inter-Token Latency (ITL) and throughput.

Prefill (prompt processing)

Processes the entire input prompt in parallel
Builds the KV cache for attention
Dominated by large matrix multiplications
Compute‑bound
Drives TTFT

Decode (token generation)

Generates output tokens one at a time (sequential)
Reuses and grows the KV cache on every step
Dominated by memory reads of model weights and KV state
Memory‑bandwidth‑bound
Drives ITL and overall throughput

This split in resource demands for the prefill and decode stages, respectively, makes “one GPU for everything” inefficient: expensive bandwidth goes unused in prefill, and abundant tensor compute sits idle in decode. Prefill‑decode disaggregation, running them on different resources, improves both throughput and latency budgets.

Choosing the right memory for the right stage of inference

GDDR for prefill, HBM for decode

Leading vendors are deploying these disaggregated architectures for inference applications. With its latest roadmap^2,3, NVIDIA is evolving its memory strategy to address the distinct prefill and decode stages of LLM inference, moving beyond a “one-size-fits-all” HBM approach toward a heterogeneous, disaggregated architecture. While HBM remains the standard for top-tier training, GDDR7 and LPDDR are increasingly used to optimize inference costs and power efficiency. NVIDIA’s Rubin CPX4 is a purpose-built prefill accelerator that drops HBM and uses less expensive GDDR7, while Rubin with HBM handles decode.

Source: Nvidia³

Because prefill is compute‑bound, reducing memory bandwidth has only a modest impact on TTFT, while eliminating HBM dramatically improves cost efficiency. Decode, by contrast, remains on HBM‑equipped Blackwell/Rubin GPUs, where multi‑terabyte‑per‑second bandwidth is essential to sustain low inter‑token latency.

At rack scale, CPX‑class prefill GPUs hand off KV caches to HBM‑rich decode GPUs over high‑speed NVLink fabrics, allowing each phase to run on hardware optimized for its dominant bottleneck.

Other vendors, such as Qualcomm⁵, are taking an alternative approach, leveraging LPDDR for disaggregated inference to balance capacity, utilization, and cost, with architecture details yet to be revealed.

LPDDR (SOCAMM) for host / pooled memory and KV‑offload

On the CPU side, NVIDIA is pivoting toward LPDDR5X² in its Vera CPUs utilizing SOCAMM (compression-attached LPDDR) modules with 128 GB per stick with up to 1.5 TB of LPDDR5X memory subsystem delivering up to 1.2 TB/s of bandwidth at low power. SOCAMM brings LPDDR efficiency with serviceable modules, creating large, power lean memory pools for KV cache offload and unified CPU GPU memory via NVLink C2C.

Software stack to optimize the hardware with orchestration and scheduling for various workloads

NVIDIA pairs the hardware with TensorRT‑LLM (paged/quantized KV, eviction), Dynamo, and SGLang/vLLM integrations for software scheduling and orchestration to achieve prefill/decode balance across tiered memories. Similarly, AMD pairs its hardware with its ROCm software stack.

The bottom line for inference: TCO

For data center operators, the shift towards heterogeneous disaggregated memory is not an option, but a necessity to achieve the lowest cost per served token. Inference efficiency today depends on matching memory technology to workload phase. HBM is the most expensive memory in the system, yet the prefill phase leaves much of its bandwidth unused. Moving prefill to GDDR‑based accelerators and reserving HBM for decode reduces the amount of premium memory required per request, lowering amortized cost per token without violating latency SLAs. LPDDR provides power‑efficient capacity for KV cache, embeddings, and system‑level memory pooling.

Workloads change over time and data center operators need to futureproof their infrastructure. Heterogeneous disaggregated memory systems in conjunction with software orchestration and scheduling provide tuning knobs, allowing operators to rebalance prefill and decode resources without wholesale platform replacement.

References

Raj Uppala

(all posts)
Raj Uppala is the Sr. Director of Marketing at Rambus where he oversees marketing and partnerships for the Silicon IP business unit. Prior to Rambus, he held several roles at Western Digital in product management, product marketing, and ecosystem partnerships, for the Hard Disk Drive (HDD) product line and a Smart Video product line encompassing Cameras, AI analytics, and Video Management System delivered as a service. Uppala began his career designing memory and mixed-signal IC's, subsequently transitioning to marketing and product line management roles across a few Semiconductor companies. He holds an MBA from Cornell University and a MS in EE from Mississippi State University.

AI Inference Needs A Mix-And-Match Memory Strategy

Inference is a dual-stage workload, with very different resources demands for each

Choosing the right memory for the right stage of inference

GDDR for prefill, HBM for decode

LPDDR (SOCAMM) for host / pooled memory and KV‑offload

Software stack to optimize the hardware with orchestration and scheduling for various workloads

The bottom line for inference: TCO

Links

References

Raj Uppala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

CPO Is Extending The Limits Of What’s Possible In AI Data Centers

Flash Getting Stacked High-Bandwidth Version

Can Edge AI Keep Up?

Chiplets Need A New Workflow

Scale Up, Scale Out Get a New Partner

AI Power on the Edge

Agentic AI Is Changing Data Center Architectures

Gates Add Functionality, But Wires Create Problems

Sponsors

Recent Comments

About

Navigation

Connect With Us

AI Inference Needs A Mix-And-Match Memory Strategy

Inference is a dual-stage workload, with very different resources demands for each

Choosing the right memory for the right stage of inference

GDDR for prefill, HBM for decode

LPDDR (SOCAMM) for host / pooled memory and KV‑offload

Software stack to optimize the hardware with orchestration and scheduling for various workloads

The bottom line for inference: TCO

Links

References

Raj Uppala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

CPO Is Extending The Limits Of What’s Possible In AI Data Centers

Flash Getting Stacked High-Bandwidth Version

Can Edge AI Keep Up?

Chiplets Need A New Workflow

Scale Up, Scale Out Get a New Partner

AI Power on the Edge

Agentic AI Is Changing Data Center Architectures

Gates Add Functionality, But Wires Create Problems

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored