Systematic Analysis of CPU-Induced Slowdowns in Multi-GPU LLM Inference (Georgia Tech)


A new technical paper, "Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference," was published by the Georgia Institute of Technology. Abstract "Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and servin... » read more

Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers


Harini Muthukrishnan (U of Michigan); David Nellans, Daniel Lustig (NVIDIA); Jeffrey A. Fessler, Thomas Wenisch (U of Michigan). Abstract—"Despite continuing research into inter-GPU communication mechanisms, extracting performance from multiGPU systems remains a significant challenge. Inter-GPU communication via bulk DMA-based transfers exposes data transfer latency on the GPU’s critical... » read more