Guillaume THIBAULT
GuillaumeTHIBAULT
AI & ML interests
None yet
Recent Activity
reacted
to
singhsidhukuldeep's
post
with 👍
about 1 month ago
Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks.
Here is how they did it:
1. Model Architecture Design:
- Developed three model architectures:
a) NVLM-D: Decoder-only architecture
b) NVLM-X: Cross-attention-based architecture
c) NVLM-H: Novel hybrid architecture
2. Vision Encoder:
- Used InternViT-6B-448px-V1-5 as the vision encoder
- Implemented dynamic high-resolution (DHR) input handling
3. Language Model:
- Used Qwen2-72B-Instruct as the base LLM
4. Training Data Curation:
- Carefully curated high-quality pretraining and supervised fine-tuning datasets
- Included diverse task-oriented datasets for various capabilities
5. Pretraining:
- Froze LLM and vision encoder
- Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers)
- Used a large batch size of 2048
6. Supervised Fine-Tuning (SFT):
- Unfroze LLM while keeping the vision encoder frozen
- Trained on multimodal SFT datasets and high-quality text-only SFT data
- Implemented 1-D tile tagging for dynamic high-resolution inputs
7. Evaluation:
- Evaluated on multiple vision-language benchmarks
- Compared performance to leading proprietary and open-source models
8. Optimization:
- Iterated on model designs and training approaches
- Used smaller 34B models for faster experimentation before scaling to 72B
9. Now comes the best part...Open-Sourcing:
- Released model weights and full technical details to the research community
The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!
Organizations
models
None public yet
datasets
None public yet