AI Inference Needs A Mix-And-Match Memory Strategy

Matching memory technology to the inference workload phase is necessary to achieve the lowest cost per served token.

popularity

AI inference is no longer a single workload that can be served efficiently by a single type of accelerator or memory. From fast chat replies to 10M token codebases, inference spans wildly diverse workloads with very different limits on latency, bandwidth, capacity, and compute, as the figure below demonstrates.1

Source: Meta1

The AI inference spectrum of workloads includes:

  • Interactive LLMs (chat, copilots, agents) with strict latency targets
  • Long‑context reasoning (codebases, research, video) with massive KV (key value) cache footprints
  • Ranking and recommendation models with enormous embedding tables
  • Batch and offline inference where throughput and cost dominate over latency

These workloads stress hardware in very different ways. Recommendation models may exceed LLMs in parameter count but require orders of magnitude fewer FLOPs, while LLMs impose extreme pressure on memory bandwidth and latency during inference. Over the last decade, model size has grown much faster than memory capacity and bandwidth, turning AI from a chip problem into a full system‑design problem.

Inference is a dual-stage workload, with very different resources demands for each

Prefill processes many input tokens in parallel to build the KV cache and produces the first token. It is compute-bound but doesn’t saturate HBM bandwidth; moving prefill off HBM can cut cost without hurting Time-To-First-Token (TTFT) much. Decode then emits tokens one by one, repeatedly reusing KV. Here bandwidth and memory hierarchy dominate Inter-Token Latency (ITL) and throughput.

Prefill (prompt processing)

  • Processes the entire input prompt in parallel
  • Builds the KV cache for attention
  • Dominated by large matrix multiplications
  • Compute‑bound
  • Drives TTFT

Decode (token generation)

  • Generates output tokens one at a time (sequential)
  • Reuses and grows the KV cache on every step
  • Dominated by memory reads of model weights and KV state
  • Memory‑bandwidth‑bound
  • Drives ITL and overall throughput

This split in resource demands for the prefill and decode stages, respectively, makes “one GPU for everything” inefficient: expensive bandwidth goes unused in prefill, and abundant tensor compute sits idle in decode. Prefill‑decode disaggregation, running them on different resources, improves both throughput and latency budgets.

Choosing the right memory for the right stage of inference

GDDR for prefill, HBM for decode

Leading vendors are deploying these disaggregated architectures for inference applications. With its latest roadmap2,3, NVIDIA is evolving its memory strategy to address the distinct prefill and decode stages of LLM inference, moving beyond a “one-size-fits-all” HBM approach toward a heterogeneous, disaggregated architecture. While HBM remains the standard for top-tier training, GDDR7 and LPDDR are increasingly used to optimize inference costs and power efficiency.  NVIDIA’s Rubin CPX4 is a purpose-built prefill accelerator that drops HBM and uses less expensive GDDR7, while Rubin with HBM handles decode.

Source: Nvidia3

Because prefill is compute‑bound, reducing memory bandwidth has only a modest impact on TTFT, while eliminating HBM dramatically improves cost efficiency. Decode, by contrast, remains on HBM‑equipped Blackwell/Rubin GPUs, where multi‑terabyte‑per‑second bandwidth is essential to sustain low inter‑token latency.

At rack scale, CPX‑class prefill GPUs hand off KV caches to HBM‑rich decode GPUs over high‑speed NVLink fabrics, allowing each phase to run on hardware optimized for its dominant bottleneck.

Other vendors, such as Qualcomm5, are taking an alternative approach, leveraging LPDDR for disaggregated inference to balance capacity, utilization, and cost, with architecture details yet to be revealed.

LPDDR (SOCAMM) for host / pooled memory and KV‑offload

On the CPU side, NVIDIA is pivoting toward LPDDR5X2 in its Vera CPUs utilizing SOCAMM (compression-attached LPDDR) modules with 128 GB per stick with up to 1.5 TB of LPDDR5X memory subsystem delivering up to 1.2 TB/s of bandwidth at low power. SOCAMM brings LPDDR efficiency with serviceable modules, creating large, power lean memory pools for KV cache offload and unified CPU GPU memory via NVLink C2C.

Software stack to optimize the hardware with orchestration and scheduling for various workloads

NVIDIA pairs the hardware with TensorRT‑LLM (paged/quantized KV, eviction), Dynamo, and SGLang/vLLM integrations for software scheduling and orchestration to achieve prefill/decode balance across tiered memories. Similarly, AMD pairs its hardware with its ROCm software stack.

The bottom line for inference: TCO

For data center operators, the shift towards heterogeneous disaggregated memory is not an option, but a necessity to achieve the lowest cost per served token. Inference efficiency today depends on matching memory technology to workload phase. HBM is the most expensive memory in the system, yet the prefill phase leaves much of its bandwidth unused. Moving prefill to GDDR‑based accelerators and reserving HBM for decode reduces the amount of premium memory required per request, lowering amortized cost per token without violating latency SLAs. LPDDR provides power‑efficient capacity for KV cache, embeddings, and system‑level memory pooling.

Workloads change over time and data center operators need to futureproof their infrastructure. Heterogeneous disaggregated memory systems in conjunction with software orchestration and scheduling provide tuning knobs, allowing operators to rebalance prefill and decode resources without wholesale platform replacement.

Links

References

  1. Meta AI Hardware Summit Presentation
  2. Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer | NVIDIA Technical Blog
  3. NVIDIA Unveils Rubin CPX: A New Class of GPU Designed for Massive-Context Inference | NVIDIA Newsroom
  4. NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads | NVIDIA Technical Blog
  5. Qualcomm Unveils AI200 and AI250—Redefining Rack-Scale Data Center Inference Performance for the AI Era | Qualcomm


Leave a Reply


(Note: This name will be displayed publicly)