Supporting Long Input Sequence Length over a Million Tokens: Observations and Insights from StreamingLLM

Don Moon
3 min readJun 13, 2024

Supporting a large input sequence or streaming presents several significant challenges:

  1. Memory Demands: There is a substantial requirement for memory to maintain Key-Value (K-V) pairs for all tokens, which can quickly escalate as the sequence lengthens.
  2. Computational Costs: Calculating attention for a large number of tokens involves considerable computational resources.
  3. Model Accuracy: Accuracy tends to decline when the input sequence surpasses the length for which the model was originally trained.
KV cache size starts dominating model size for large sequence length (Llama-2 7B)
Accuracy of conventional inference (Dense Attention) drops after pre-training input length (high perplexity(PPL) means low accuracy)

Suggested Solutions: StreamingLLM

The paper, “Efficient Streaming Language Models with Attention Sinks”, presents two simple solutions to enable streaming (or very large input sequence length) inference.

  1. Rolling KV Cache with Attention Sinks (No Training Required):

This method involves incorporating a few starting tokens’ Key-Value (KV) pairs into the attention computations alongside the sliding window tokens. The KV cache in…

--

--

No responses yet