Supporting Long Input Sequence Length over a Million Tokens: Observations and Insights from StreamingLLM

3 min readJun 13, 2024

--

Supporting a large input sequence or streaming presents several significant challenges:

Memory Demands: There is a substantial requirement for memory to maintain Key-Value (K-V) pairs for all tokens, which can quickly escalate as the sequence lengthens.
Computational Costs: Calculating attention for a large number of tokens involves considerable computational resources.
Model Accuracy: Accuracy tends to decline when the input sequence surpasses the length for which the model was originally trained.

KV cache size starts dominating model size for large sequence length (Llama-2 7B)

Accuracy of conventional inference (Dense Attention) drops after pre-training input length (high perplexity(PPL) means low accuracy)

Suggested Solutions: StreamingLLM

The paper, “Efficient Streaming Language Models with Attention Sinks”, presents two simple solutions to enable streaming (or very large input sequence length) inference.

Rolling KV Cache with Attention Sinks (No Training Required):

This method involves incorporating a few starting tokens’ Key-Value (KV) pairs into the attention computations alongside the sliding window tokens. The KV cache in…

Don Moon

Written by Don Moon

www.linkedin.com/in/dongukmoon/

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams