Supporting Long Input Sequence Length over a Million Tokens: Observations and Insights from StreamingLLM
Supporting a large input sequence or streaming presents several significant challenges:
- Memory Demands: There is a substantial requirement for memory to maintain Key-Value (K-V) pairs for all tokens, which can quickly escalate as the sequence lengthens.
- Computational Costs: Calculating attention for a large number of tokens involves considerable computational resources.
- Model Accuracy: Accuracy tends to decline when the input sequence surpasses the length for which the model was originally trained.
Suggested Solutions: StreamingLLM
The paper, “Efficient Streaming Language Models with Attention Sinks”, presents two simple solutions to enable streaming (or very large input sequence length) inference.
- Rolling KV Cache with Attention Sinks (No Training Required):
This method involves incorporating a few starting tokens’ Key-Value (KV) pairs into the attention computations alongside the sliding window tokens. The KV cache in…