Byte-Sized AI

Curated Archive of the Latest AI News and Technologies

Follow publication

Member-only story

[vLLM — Prefix KV Caching] vLLM’s Automatic Prefix Caching vs ChunkAttention

Don Moon
Byte-Sized AI
Published in
7 min readDec 25, 2024

The prefix KV caching mechanism in vLLM enhances large language model inference by reusing previously computed key-value pairs from attention layers, streamlining the processing of new tokens without redundant computations on the entire input [2]. However, its application is limited to the prefill phase, leaving the decode phase unoptimized. In this blog, we explore the paper “ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition,” which extends KV caching to the decode phase of LLM inference, addressing this limitation and improving overall efficiency.

vLLM Automatic Prefix Caching

PagedAttention introduces a KV cache management technique by partitioning the KV cache of each request into discrete KV blocks, where each block contains keys and values for a fixed number of tokens. These blocks are stored in non-contiguous physical memory to eliminate fragmentation through on-demand allocation.

Each KV block is uniquely identified using a combination of its own tokens and the prefix tokens preceding it. For instance, in the sequence of “A gentle breeze stirred the leaves as children laughed in the distance”:

  • Block 1: “A gentle breeze stirred”
  • Block 2: Prefix: “A gentle breeze stirred”, Block: “the leaves as children”
  • Block 3: Prefix: “A gentle breeze stirred the leaves as children”, Block: “laughed in the distance”
                    Block 1                  Block 2                  Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|

The mapping is established as hash(prefix tokens + block tokens) <--> KV Block. This approach adds an indirection layer in KV cache management by mapping logical blocks to their hash values and maintaining a global hash table for physical blocks. Shared prefix blocks across requests can then map to the same physical block, reducing redundant memory use.

hash(prefix tokens + block tokens)…

No responses yet

Write a response