Member-only story
Multi-Modal Vision Language Models: Architecture and Key Design Considerations
Since OpenAI’s latest demo showcasing GPT-4’s vision capabilities, interest in vision-language models (VLMs) has surged significantly. In this blog, we will cover typical VLM architectures, training recipes, and key design choices that affect model performance and compute efficiency.
How Do Vision-Language Models Work?
Vision-language models integrate visual and textual information to perform a variety of tasks. Here’s a simplified breakdown of their workflow:
- Vision Encoder :Input images are processed by the vision encoder, which extracts visual features from the images.
- Multimodal Connector: The extracted visual features are mapped (and optionally pooled) to the language model (LLM) input space, creating visual tokens.
- Concatenation with Text Embeddings: These visual tokens are concatenated (and potentially interleaved) with the input sequence of text embeddings.
- Processing by the Language Model: The concatenated sequence of visual and textual tokens is fed into the language model, which then predicts the output text tokens.

VLM Training Recipe
Similar to LLM training, Training VLMs consists of two main stages: pre-training and instruction fine-tuning.
- Pre-Training
The main purpose of this stage is to align visual feature space with the language model’s text feature space by learning the connector layers.
- Datasets: Pre-training typically uses datasets such as image-text pairs, interleaved image-text documents, and OCR annotations (OCR-IDL).
- Process: During pre-training, the multimodal connector layers are trained while the pre-trained vision encoder and LLM backbone remain frozen.

