Towards Multimodal In-Context Learning for Vision & Language Models

Posted Apr 13, 2025

By Wei Xiong

5 min read

Paper Link

1. The Core Problem: VLMs Struggle with Learning from Examples

Current Strength: Today’s Vision-Language Models (VLMs), like LLaVA, excel at zero-shot tasks – understanding instructions about a single image they haven’t been specifically trained for (e.g., “Caption this,” “What’s in the image?”).
The Weakness: They perform poorly on In-Context Learning (ICL). ICL is the powerful ability, common in Large Language Models (LLMs), to learn a new task on the fly just by seeing a few relevant examples (called “shots”) provided within the input prompt.
Why It’s a Gap: Even VLMs trained on vast amounts of interleaved image-text data often fail ICL tasks, especially when the task requires understanding a subtle pattern or concept shared across the examples (like recognizing a specific style, object type, or attribute). This is likely because they weren’t explicitly trained how to learn from context examples.
Importance: Bridging this gap is crucial for making VLMs more flexible, user-friendly, and adaptable to novel tasks without needing full retraining.

2. Research Goal: Make VLMs Effective In-Context Learners

To develop a straightforward yet effective method to significantly boost the ICL capabilities of existing VLMs.
Crucially, to achieve this without degrading their pre-existing zero-shot instruction-following abilities.

3. The Proposed Solution: Semantically Coherent, Multi-Turn ICL Tuning

Foundation: The method builds upon the well-established LLaVA-1.5 13B architecture:
- Vision Encoder (E): Frozen CLIP ViT-L/14 (extracts image features).
- Projector (P): Trainable MLP (maps vision features to language space).
- LLM Decoder (D): Trainable Vicuna-1.5-13B (processes text + vision tokens, generates response).
Focus: The innovation lies in the fine-tuning stage and the specific training data strategy, not in changing the model architecture. Fine-tuning updates the Projector (P) and the LLM Decoder (D).

Key Technique: Multi-Turn ICL Conversations

Structure: Training data is formatted as conversations. Each conversation consists of multiple “turns” (Human prompt + GPT response), where each turn includes an image and represents one ICL demonstration shot.

  
# Example ICL Training Conversation (2 shots + 1 query)
Human: [Text defining Shot 1] <image1>  GPT: [Answer for Shot 1]
Human: [Text defining Shot 2] <image2>  GPT: [Answer for Shot 2]
Human: [Text defining Query]  <image3>  GPT: ??? (Model learns to predict this)

Semantic Coherence (Vital!): Within a single training conversation, all the shots must be semantically related. They should share a common underlying concept or task goal (e.g., all shots ask about color, all involve counting objects, all require captioning in a specific style, all show examples of discriminating between two bird species). This teaches the model what pattern to look for in the examples.
“Any-Shot” Training: The standard next-token-prediction (Causal LM) objective, combined with this multi-turn format, naturally trains the model for varying numbers of shots simultaneously (0-shot for the first turn’s prediction, 1-shot for the second, etc.), making it efficient.

The Optimal Training Data Mix:
- ICL Data Component: Generated using specific partitions from SEED-Bench-2 (Tasks 1-4, 90% of Task 5) and VL-Checklist (70% split) to ensure semantic coherence. The best performing mix (Mix ID 5 from Table 3) had roughly these proportions for the ICL examples:
  - By Semantic Concept: Attributes (~45%), Categories (~36%), Relations (~15%), Instances (~3%).
  - By Task Format: Multiple Choice (~42%), Open QA (~40%), Captioning (~18%).
- Replay Data Component: The entire original LLaVA visual instruction tuning dataset (mostly single-image, zero-shot examples) is added to the ICL data.
  - Purpose: Prevents catastrophic forgetting of base skills and empirically boosts ICL performance.
- Final Training Set = ICL Data Component + Replay Data Component.

4. Testing the Approach (Evaluation Settings)

Baselines: Compared against LLaVA-1.5, LLaVA-1.6, IDEFICS 9B, OpenFlamingo 9B, EMU2 37B.
Key Evaluation Areas:
- Novel Few-Shot Recognition ICL Tasks: Standard classification datasets (Dogs, CUB, Food, Cars, Flowers) were reformatted into 2-way, 1-shot MC ICL episodes to test generalization to unseen recognition tasks.
- Diverse ICL Benchmarks: Held-out sections of SEED-Bench-2 (Instance Counting, unseen tasks 6-8, native ICL Task 23) and VL-Checklist (MC, QA, Cap formats on held-out 30%).
- Base Skill Preservation: Performance measured on the comprehensive MME benchmark.
Metrics: Primarily Accuracy (Exact Match), except for SEED Task 23 which uses Perplexity.

5. Key Findings: The Method Works Well

Massive ICL Boost: The proposed model (“Ours 13B”) significantly outperformed all baselines across the board on ICL tasks.
- Average improvement: +11.3% over the strongest baseline (LLaVA-1.6).
- Few-Shot Recognition ICL: +12.8% average improvement.
- SEED Task 23 (Coherent Captioning): +4.16% improvement, validating the semantic coherence focus.
Semantic Coherence is Key: Explicit training on coherent examples proved more effective than just pre-training on generic interleaved data.
Good Generalization: The learned ICL skill transferred effectively to unseen task formats (like few-shot recognition) and semantic concepts.
Original Skills Intact: The model retained its base zero-shot abilities, confirmed by strong scores on the MME benchmark (Table 4), thanks to the inclusion of replay data.
Scales with Data: Performance consistently increased with more ICL training data (Figure 3), indicating potential for further gains.
Uses Context Examples: The model demonstrably leverages the provided shots at inference time; performance increased when going from 0 to 1 to 2 shots (Table 5).

6. Limitations and Future Directions

Context Window Limit: The Vicuna-1.5-13B’s 2K token context length restricts the practical number of image-based ICL shots to around 3. Using models with larger context windows could allow learning from more examples.
Further Optimization: Potential exists to refine the data mixing strategies, explore additional ICL task types, and investigate more nuanced ways to define and leverage semantic coherence between shots.

7. Concluding Takeaway

This research demonstrates that explicitly fine-tuning VLMs using multi-turn conversations that contain semantically coherent ICL examples (shots), combined with replaying base instruction data, is a simple yet highly effective strategy. It significantly enhances the crucial In-Context Learning capability, making VLMs more adaptable and powerful for few-shot tasks, without compromising their fundamental abilities. The careful design of the training data, particularly ensuring semantic coherence, is paramount.

LLM, Bio-Engineering

This post is licensed under CC BY 4.0 by the author.