Member-only story
[vLLM — Prefix KV Caching] vLLM’s Automatic Prefix Caching vs ChunkAttention
The prefix KV caching mechanism in vLLM enhances large language model inference by reusing previously computed key-value pairs from attention layers, streamlining the processing of new tokens without redundant computations on the entire input [2]. However, its application is limited to the prefill phase, leaving the decode phase unoptimized. In this blog, we explore the paper “ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition,” which extends KV caching to the decode phase of LLM inference, addressing this limitation and improving overall efficiency.
vLLM Automatic Prefix Caching
PagedAttention introduces a KV cache management technique by partitioning the KV cache of each request into discrete KV blocks, where each block contains keys and values for a fixed number of tokens. These blocks are stored in non-contiguous physical memory to eliminate fragmentation through on-demand allocation.

Each KV block is uniquely identified using a combination of its own tokens and the prefix tokens preceding it. For instance, in the sequence of “A gentle breeze stirred the leaves as children laughed in the distance”:
- Block 1: “A gentle breeze stirred”
- Block 2: Prefix: “A gentle breeze stirred”, Block: “the leaves as children”
- Block 3: Prefix: “A gentle breeze stirred the leaves as children”, Block: “laughed in the distance”
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
The mapping is established as hash(prefix tokens + block tokens) <--> KV Block
. This approach adds an indirection layer in KV cache management by mapping logical blocks to their hash values and maintaining a global hash table for physical blocks. Shared prefix blocks across requests can then map to the same physical block, reducing redundant memory use.
hash(prefix tokens + block tokens)…