Matching memory technology to the inference workload phase is necessary to achieve the lowest cost per served token.
AI inference is no longer a single workload that can be served efficiently by a single type of accelerator or memory. From fast chat replies to 10M token codebases, inference spans wildly diverse workloads with very different limits on latency, bandwidth, capacity, and compute, as the figure below demonstrates.1

Source: Meta1
The AI inference spectrum of workloads includes:
These workloads stress hardware in very different ways. Recommendation models may exceed LLMs in parameter count but require orders of magnitude fewer FLOPs, while LLMs impose extreme pressure on memory bandwidth and latency during inference. Over the last decade, model size has grown much faster than memory capacity and bandwidth, turning AI from a chip problem into a full system‑design problem.
Prefill processes many input tokens in parallel to build the KV cache and produces the first token. It is compute-bound but doesn’t saturate HBM bandwidth; moving prefill off HBM can cut cost without hurting Time-To-First-Token (TTFT) much. Decode then emits tokens one by one, repeatedly reusing KV. Here bandwidth and memory hierarchy dominate Inter-Token Latency (ITL) and throughput.
Prefill (prompt processing)
Decode (token generation)
This split in resource demands for the prefill and decode stages, respectively, makes “one GPU for everything” inefficient: expensive bandwidth goes unused in prefill, and abundant tensor compute sits idle in decode. Prefill‑decode disaggregation, running them on different resources, improves both throughput and latency budgets.
Leading vendors are deploying these disaggregated architectures for inference applications. With its latest roadmap2,3, NVIDIA is evolving its memory strategy to address the distinct prefill and decode stages of LLM inference, moving beyond a “one-size-fits-all” HBM approach toward a heterogeneous, disaggregated architecture. While HBM remains the standard for top-tier training, GDDR7 and LPDDR are increasingly used to optimize inference costs and power efficiency. NVIDIA’s Rubin CPX4 is a purpose-built prefill accelerator that drops HBM and uses less expensive GDDR7, while Rubin with HBM handles decode.


Source: Nvidia3
Because prefill is compute‑bound, reducing memory bandwidth has only a modest impact on TTFT, while eliminating HBM dramatically improves cost efficiency. Decode, by contrast, remains on HBM‑equipped Blackwell/Rubin GPUs, where multi‑terabyte‑per‑second bandwidth is essential to sustain low inter‑token latency.
At rack scale, CPX‑class prefill GPUs hand off KV caches to HBM‑rich decode GPUs over high‑speed NVLink fabrics, allowing each phase to run on hardware optimized for its dominant bottleneck.
Other vendors, such as Qualcomm5, are taking an alternative approach, leveraging LPDDR for disaggregated inference to balance capacity, utilization, and cost, with architecture details yet to be revealed.
On the CPU side, NVIDIA is pivoting toward LPDDR5X2 in its Vera CPUs utilizing SOCAMM (compression-attached LPDDR) modules with 128 GB per stick with up to 1.5 TB of LPDDR5X memory subsystem delivering up to 1.2 TB/s of bandwidth at low power. SOCAMM brings LPDDR efficiency with serviceable modules, creating large, power lean memory pools for KV cache offload and unified CPU GPU memory via NVLink C2C.
NVIDIA pairs the hardware with TensorRT‑LLM (paged/quantized KV, eviction), Dynamo, and SGLang/vLLM integrations for software scheduling and orchestration to achieve prefill/decode balance across tiered memories. Similarly, AMD pairs its hardware with its ROCm software stack.
For data center operators, the shift towards heterogeneous disaggregated memory is not an option, but a necessity to achieve the lowest cost per served token. Inference efficiency today depends on matching memory technology to workload phase. HBM is the most expensive memory in the system, yet the prefill phase leaves much of its bandwidth unused. Moving prefill to GDDR‑based accelerators and reserving HBM for decode reduces the amount of premium memory required per request, lowering amortized cost per token without violating latency SLAs. LPDDR provides power‑efficient capacity for KV cache, embeddings, and system‑level memory pooling.
Workloads change over time and data center operators need to futureproof their infrastructure. Heterogeneous disaggregated memory systems in conjunction with software orchestration and scheduling provide tuning knobs, allowing operators to rebalance prefill and decode resources without wholesale platform replacement.
Leave a Reply