Sorry, you need to enable JavaScript to visit this website.

From Heuristics to Principles: A Practical Model for LLM Inference

Grand Mesa F

Wed Apr 29 | 1:35pm

Abstract

Large Language Model (LLM) inference at scale is governed by a complex interplay between compute, memory bandwidth, batching, and system architecture. Despite rapid innovation—such as Prefill-Decode Disaggregation (PDD), chunked prefill scheduling, and KV-cache offloading—practical deployment remains largely heuristic-driven, with limited understanding of when and why each technique delivers benefit.
This talk presents a simple but powerful analytical framework for reasoning about LLM inference performance in modern multi-GPU and multi-node systems. We introduce two core metrics: incremental prefill compute cost and incremental KV-cache memory load cost, capturing the fundamental trade-offs underlying inference efficiency. Using these building blocks, we derive closed-form models that explain the behavior of aggregated execution, PDD, and chunked prefill, both with and without KV-cache offloading.
The analysis reveals several non-obvious insights: when PDD provides real gains (and when it does not), how to optimally partition GPUs between prefill and decode, and why chunked prefill can achieve up to 2x throughput improvements by overlapping compute- and memory-bound phases. We further show how prefill acceleration through KV-cache reuse translates into end-to-end throughput gains under continuous batching and how it composes with other architectural techniques.
This framework provides actionable guidance for designing and operating high-performance, cost-efficient LLM inference systems - turning a complex optimization problem into a set of principled, predictable trade-offs.

Download PDF