Supporting Long Input Sequence Length over a Million Tokens: Observations and Insights from StreamingLLM | by Don Moon | Medium

Member-only story
Supporting Long Input Sequence Length over a Million Tokens: Observations and Insights from StreamingLLM
Don Moon
·Follow
3 min read·
Jun 13, 2024
--
Supporting a large input sequence or streaming presents several significant challenges:
Memory Demands: There is a substantial requirement for memory to maintain Key-Value (K-V) pairs for all tokens, which can quickly escalate as the sequence lengthens.
Computational Costs: Calculating attention for a large number of tokens involves considerable computational resources.
Model Accuracy: Accuracy tends to decline when the input sequence surpasses the length for which the model was originally trained.
KV cache size starts dominating model size for large sequence length (Llama-2 7B)
Accuracy of conventional inference (Dense Attention) drops after pre-training input length (high perplexity(PPL) means low accuracy)Suggested Solutions: StreamingLLMThe paper, “Efficient Streaming Language Models with Attention Sinks”, presents two simple solutions to enable streaming (or very large input sequence length) inference.
Rolling KV Cache with Attention Sinks (No Training Required):
This method involves incorporating a few starting tokens’ Key-Value (KV) pairs into the attention computations alongside the sliding window tokens. The KV cache in…
--
--
Written by Don Moon56 Followers
·13 Following
www.linkedin.com/in/dongukmoon/
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams