Architecting High-Goodput LLM Serving Systems: A Comparative Analysis of Prefill and Decode Strategies
1.0 Introduction: Beyond Throughput to User-Centric Performance
As Large Language Model (LLM) services scale to handle thousands of concurrent users with increasingly stringent latency guarantees, the industry is undergoing a critical shift in performance evaluation. The focus is moving from maximizing raw throughput, measured in tokens per second, to optimizing "goodput"—the rate of requests successfully served within strict Service Level Objectives (SLOs). The central challenge in achieving high goodput lies in managing the inherent conflict between the two fundamental phases of LLM inference—prefill and decode—especially under the dynamic and unpredictable load of a multi-tenant environment.
This whitepaper provides a systematic analysis of architectural strategies for LLM serving, intended to guide systems architects through the complex trade-offs of performance, cost, and user experience. The document's core argument is that for multi-tenant, latency-sensitive services, architectures that explicitly manage the interference between prefill and decode are superior to traditional aggregated approaches. For large-scale deployments, this analysis strongly favors architectures that disaggregate these phases onto separate hardware resources.
To understand the advanced solutions that enable this new paradigm of performance, one must first grasp the fundamental computational differences between the prefill and decode phases, as this distinction is the root cause of the performance bottlenecks that modern serving systems must overcome.
2.0 The Foundational Conflict: Prefill vs. Decode in Aggregated Systems
The strategic importance of understanding the two distinct phases of LLM inference cannot be overstated. The unique computational profiles of the prefill (prompt processing) and decode (token generation) phases are the primary source of performance bottlenecks in multi-tenant serving systems. Comprehending this dichotomy is the critical first step toward designing effective, high-goodput architectures.
The two phases can be contrasted by their distinct computational characteristics:
| Characteristic | Prefill (Prompt Processing) | Decode (Autoregressive Generation) |
|---|---|---|
| Input | The entire user prompt, processed at once. | One new token at each step, plus a cached history of previous tokens. |
| Operation | A single, large forward pass over the full sequence. | Many sequential, small forward passes (one per generated token). |
| Computational Nature | Highly compute-bound, dominated by large matrix multiplications over long sequences. | Primarily memory-bandwidth-bound, with lower arithmetic intensity per step. |
| Parallelism Profile | Strong intra-sequence parallelism is achievable. | Parallelism is primarily across requests in a batch, not within the sequential generation steps. |
This fundamental difference is succinctly captured by the architects of the DistServe system:
"Prefill is very compute-bound, meaning a small batch of prefills or even a single long enough prefill will easily saturate GPU computation. On the other hand, decoding needs a much bigger batch size to hit the compute bound, and is more easily subject to the memory bandwidth limit of the GPU." (Zhong et al., 2024)
These computational phases map directly to user-facing performance metrics. Time to First Token (TTFT) is the perceived responsiveness of the system, dominated by the latency of the prefill phase. In contrast, Time Per Output Token (TPOT), also known as Time-Between-Tokens (TBT), measures the speed of the streaming response and is dictated by the efficiency of the decode phase.
In modern serving systems like vLLM that use continuous batching, a common "prefill-first" scheduling policy prioritizes new requests to minimize TTFT. However, this creates a severe conflict. When a new, compute-heavy prefill task is batched with ongoing decode tasks, the entire GPU operation is bottlenecked by the prefill. This effectively blocks or delays the much faster decode steps. As described by TNG Technology Consulting (2025), the end-user impact is a frustrating "pausing of the streamed token generation when other users submit long prompts."
The architectural consequence of this interference is severe. While a prefill-first policy may increase raw throughput, it degrades goodput by causing frequent TBT SLO violations for in-flight requests. This degradation forces system operators into a reactive cycle of costly over-provisioning, allocating far more GPU hardware than necessary just to reclaim an acceptable level of goodput. This fundamental conflict has spurred the development of new techniques aimed at mitigating this interference, beginning with chunked prefill.
3.0 Mitigation via Interleaving: The Role of Chunked Prefill
Chunked prefill has emerged as an essential optimization pattern for mitigating interference within the monolithic resource envelope of a single-GPU (aggregated) architecture. It represents a crucial first step in making aggregated systems more efficient and responsive under concurrent load, though it does not eliminate the core conflict.
The core concept is to split a large input prompt into smaller, more manageable chunks. Instead of processing the entire prompt in one monolithic, blocking operation, the system processes it chunk by chunk. This creates opportunities for the scheduler to interleave faster decode steps between the processing of each prefill chunk.
- As TNG Technology Consulting (2025) explains, "there can be as many concurrent decode steps during prefill as there are prefill chunks… for small chunk sizes the user now experiences only a slowing-down of token generation instead of a complete pause."
- NVIDIA (2024) adds that this "prevents the prefill phase from becoming a bottleneck, enables more parallelization with decode phase tokens, and increases GPU utilization."
However, this approach introduces a critical trade-off centered on the chunk size.
- Large chunks: These reduce the overhead of multiple kernel launches, resulting in a lower TTFT for the request being prefilled. However, they create longer blocking periods, effectively starving ongoing decodes and degrading the Time-Between-Tokens (TBT) and overall output tokens per second (TPS) for other users.
- Small chunks: These create more opportunities to interleave decode steps, significantly improving TBT for other users and enhancing the interactive feel of the system. The trade-off is higher kernel launch overhead, which can reduce overall computational efficiency if chunks are too small.
The empirical impact of this technique is notable. In a real-world multi-tenant deployment on vLLM, TNG Technology Consulting observed that enabling chunked prefill "increased the total token throughput by +50%." This gain is achieved by more effectively parallelizing the compute-intensive prefill operations with the memory-bound decode operations, increasing overall GPU utilization.
Advanced implementations seek to optimize the chunk size trade-off automatically. The vLLM community has proposed hybrid chunked prefill, an adaptive mechanism that switches between chunked and non-chunked modes based on system load. This approach has demonstrated a 2–5% increase in total throughput and a 10–20% lower TTFT at low concurrency. Similarly, NVIDIA TensorRT-LLM features dynamic chunk sizing, which automatically selects an optimal chunk size based on GPU utilization, simplifying configuration and improving memory management.
Despite these benefits, chunked prefill is fundamentally a workaround constrained by resource colocation. Prefill and decode operations still compete for the same GPU compute and memory resources, leading to residual interference, especially under high load. The architects of DistServe conclude that while promising for maximizing raw throughput, chunked prefill is insufficient when a service must honor strict SLOs for both TTFT and TPOT simultaneously.
While chunked prefill optimizes the performance of the aggregated model, a more radical architectural change—physically separating the phases—offers a more complete solution to the interference problem.
4.0 A Paradigm Shift: Prefill-Decode Disaggregation
Prefill-decode disaggregation represents a system-level redesign that challenges the colocation assumption entirely. Instead of mitigating interference on shared hardware, this approach eliminates it by physically separating prefill and decode workloads onto distinct, specialized hardware pools.
The canonical example of this architecture is DistServe, a system built on this principle. The architecture consists of two primary components:
- Prefill workers: A pool of GPUs dedicated to processing the compute-heavy prefill phase. These workers can be configured with parallelism and batching strategies optimized specifically for prompt processing to meet tight TTFT SLOs.
- Decode workers: A separate pool of GPUs optimized for the memory-bandwidth-bound decode phase. These can be tuned for high-concurrency generation to ensure low and predictable TBT.
The core workflow involves a crucial handoff between these pools. After a prefill worker processes a prompt, it transfers the resulting KV cache—the model's internal state—to a decode worker, which then takes over the autoregressive generation process.
This disaggregated design yields two primary advantages, as identified by Zhong et al. (2024):
- Elimination of Interference: By running on separate hardware, prefill load surges no longer slow down ongoing decodes. This makes both TTFT and TBT predictable, stable, and independently optimizable.
- Decoupled Resource Allocation: System operators gain the flexibility to tune the number of GPUs, tensor parallelism degree, and batching strategies for each phase independently. This allows for far more efficient resource allocation tailored to the distinct computational profiles of prefill and decode.
The empirical evidence supporting disaggregation is compelling. In a direct comparison against a state-of-the-art aggregated system (vLLM), DistServe demonstrated significant improvements in goodput and SLO adherence across various workloads.
| Workload | Performance Metric | Improvement of DistServe vs. vLLM |
|---|---|---|
| Chatbot | Goodput | 2.0× – 3.41× higher |
| Code completion | Goodput | 3.2× higher |
| Code completion | SLO Tightness | Supports 1.5× tighter SLOs |
| Summarization | Goodput | 4.48× higher |
| Summarization | SLO Tightness | Up to 10.2× tighter SLOs possible |
This data underscores how severely interference in aggregated systems can degrade performance. In a direct comparison with chunked prefill, the DistServe authors conclude that when both TTFT and TPOT must be strictly honored, "disaggregation emerges as a better choice."
Of course, this powerful approach comes with inherent costs and trade-offs. The primary complexities include model duplication, as weights must be loaded on both prefill and decode workers; KV cache transfer overhead across the GPU interconnect; and increased routing complexity to manage the two separate worker pools. However, the performance data strongly suggests that for large-scale systems, these overheads are significantly outweighed by the gains from eliminating interference and costly over-provisioning.
While disaggregation offers a complete solution via physical separation, an alternative architectural model seeks to capture the benefits of isolation without the multi-GPU overhead of full disaggregation.
5.0 The Middle Ground: Intra-GPU Isolation via SM Multiplexing
As an advanced alternative to physical disaggregation, systems leveraging Streaming Multiprocessor (SM) multiplexing aim to achieve logical workload isolation on a single GPU. This approach, exemplified by DuetServe, offers a compelling compromise between the operational simplicity of aggregation and the performance predictability of disaggregation, making it a valuable pattern for resource-constrained environments.
The DuetServe design is based on adaptive GPU spatial multiplexing. By default, it operates as an aggregated system. However, when its Attention-aware roofline model predicts that resource contention will cause a degradation in Time-Between-Tokens (TBT), it dynamically partitions the GPU's SMs, allocating a dedicated subset to prefill tasks and the remainder to decode tasks. This creates temporary, logical isolation that prevents the compute-heavy prefill operations from blocking the latency-sensitive decode operations.
DuetServe reports that this technique can deliver up to 1.3× higher total throughput compared to state-of-the-art aggregated frameworks while maintaining low generation latency. This gain, while significant, is smaller than the multi-fold improvements seen with full cross-GPU disaggregation, as it is designed for scenarios where dedicating multiple GPUs per model is not feasible.
From a comparative perspective, DuetServe occupies a unique position. Relative to pure aggregation, it adds a dynamic layer of isolation that activates only when needed to protect latency SLOs. Relative to full disaggregation, it avoids the overhead of model duplication and KV cache transfers, but at the cost of providing only partial isolation, as some global GPU resources like memory bandwidth are still shared.
The emergence of both physical disaggregation (DistServe) and logical, intra-GPU isolation (DuetServe) signals an industry-wide convergence on a core principle: for high-performance LLM serving, phase isolation is essential. The choice is no longer if isolation is needed, but how it should be implemented.
6.0 Architectural Decision Framework: A Comparative Analysis
The choice of a serving architecture depends on a project's specific constraints, including scale, budget, and latency requirements. The analysis of aggregated, chunked, disaggregated, and multiplexed strategies can be synthesized into a clear decision framework to guide architects in selecting the most appropriate design for their objectives. The following table provides a comprehensive comparison of these approaches.
| Strategy | Key Techniques & Characteristics | Advantages (Pros) | Limitations & Costs (Cons) |
|---|---|---|---|
| Aggregated (Baseline) | Prefill and decode tasks are co-located on the same GPU and processed using continuous batching with a prefill-first policy. | Simplicity of implementation and high computational efficiency per kernel launch. | Severe interference between prefill and decode, leading to poor TBT under load. Forces costly over-provisioning to meet competing SLOs. |
| Aggregated with Chunked Prefill | The prefill prompt is split into smaller chunks, allowing decode steps to be interleaved. May use dynamic or hybrid chunk sizing. | +50% throughput observed in production (TNG); improved TBT and GPU utilization; better handling of long contexts. | Residual interference persists; chunk size tuning is complex. A within-GPU workaround, not a fundamental solution. |
| Disaggregated (e.g., DistServe) | Prefill and decode workloads are physically separated onto distinct pools of GPUs, with KV caches transferred between them. | 2–4.48× higher goodput; supports up to 10.2× tighter SLOs; eliminates direct interference; makes TTFT and TBT predictable and independently optimizable. | Model duplication required; introduces KV cache transfer overhead; higher operational complexity for routing and scaling. |
| Single-GPU SM Multiplexing (e.g., DuetServe) | GPU Streaming Multiprocessors (SMs) are dynamically partitioned between prefill and decode tasks on a single GPU to create logical isolation. | Up to 1.3× throughput while respecting TBT SLOs; avoids multi-GPU overhead like model duplication and KV transfers. | Partial isolation only; shared global resources like memory bandwidth remain a contention point. Implementation is hardware-specific. |
Based on this comparative evidence, a set of clear, opinionated recommendations can be formulated to guide the design and deployment of modern, production-grade LLM serving systems.
7.0 Opinionated Recommendations for Production LLM Deployments
This section distills the findings of the preceding analysis into a concrete deployment philosophy, offering a set of actionable, evidence-based directives for practitioners architecting high-performance LLM infrastructure.
- Adopt Goodput as the Primary KPI — Prioritize the rate of requests served within latency SLOs (goodput) over raw tokens per second. Failing to do so means optimizing for a vanity metric while delivering a poor user experience, leading to SLO violations and potential revenue loss. Track TTFT and TBT distributions for different workload classes and test against realistic, mixed-workload scenarios.
- Architecturally Isolate Prefill and Decode Concerns — Regardless of the physical deployment model, treat prefill and decode as distinct services. This means implementing separate scheduling logic, batching policies, and resource priorities for each phase, even when they are co-located on the same hardware. Avoid simplistic "prefill-first" policies that invariably sacrifice decode performance.
- Enable Hybrid/Dynamic Chunked Prefill by Default on Aggregated Systems — For any single-GPU deployment or cost-constrained environment that cannot support full disaggregation, chunked prefill is the essential first optimization. The empirical evidence of a +50% throughput gain (TNG, 2025) and improved interactivity makes it a non-negotiable feature. Prefer adaptive or dynamic implementations to minimize manual tuning and maximize benefits across varying loads.
- Prioritize Disaggregation for Multi-GPU, Latency-Sensitive Deployments — Where scale and budget permit the allocation of multiple GPUs per model, a disaggregated architecture like DistServe must be the default choice. The demonstrated 2–4.48× goodput gains are not just a performance metric; they translate directly into capital efficiency, allowing the system to serve more users within SLO on the same hardware footprint, thereby delaying costly scaling events.
- Explore Intra-GPU Multiplexing When GPU Resources are Scarce — For environments where GPU count is a hard constraint (e.g., edge deployments or smaller clusters), a DuetServe-like approach of dynamic SM partitioning is a viable strategy. It offers a way to achieve the benefits of partial isolation—namely, protecting TBT latency—without requiring additional GPUs.
- Isolate Heterogeneous Workloads — Actively prevent workloads with vastly different characteristics from interfering with each other. A long-form summarization job with a massive prompt should not be allowed to degrade the interactive performance of a dozen concurrent chat sessions. Use dedicated worker pools, queue partitioning, or scheduling priorities to logically or physically separate these workloads.
These practical steps reflect a broader evolution in the industry's understanding of how to build truly scalable, responsive, and cost-effective LLM infrastructure.
8.0 Conclusion: The Imperative of Phase-Aware Serving
The evolution of LLM serving has made one thing clear: managing the concurrency between the prefill and decode phases is no longer an implementation detail but a central design consideration that dictates system performance, cost, and user satisfaction. The analysis shows that while chunked prefill offers an essential mitigation strategy for aggregated systems, disaggregation provides the most effective and robust pattern for achieving high goodput. Meanwhile, intra-GPU SM multiplexing presents a promising compromise for resource-constrained environments.
As user expectations for fluid, interactive AI experiences rise and context lengths continue to grow, the architectural patterns that succeed will be those that explicitly acknowledge, manage, and isolate the prefill and decode phases. Moving toward these phase-aware architectures will become the standard for building the next generation of scalable, responsive, and cost-effective LLM infrastructure.
9.0 References
- Gao, L., Jiang, C., Entezari Zarch, H., Wong, D., & Annavaram, M. (2025, November 6). DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing. arXiv. https://arxiv.org/abs/2511.04791
- NVIDIA. (2024). Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill. NVIDIA Technical Blog. https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/
- TNG Technology Consulting GmbH. (2025, April 16). Prefill and Decode for Concurrent Requests – Optimizing LLM Performance. Hugging Face Blog. https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests
- vLLM Community. (2025, November 27). Requesting review for PR #26625 (Hybrid Chunked Prefill). vLLM Forums. https://discuss.vllm.ai/t/requesting-review-for-pr-26625-hybrid-chunked-prefill/2035
- Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., & Zhang, H. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
- Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., & Zhang, H. (2024). Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation. Hao AI Lab @ UCSD Blog. https://hao-ai-lab.github.io/blogs/distserve/
- Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., & Zhang, H. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv. https://arxiv.org/abs/2401.09670