Guillaume THIBAULT's picture

Guillaume THIBAULT

GuillaumeTHIBAULT

AI & ML interests

None yet

Recent Activity

reacted to singhsidhukuldeep's post with 👍 about 1 month ago

Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks. Here is how they did it: 1. Model Architecture Design: - Developed three model architectures: a) NVLM-D: Decoder-only architecture b) NVLM-X: Cross-attention-based architecture c) NVLM-H: Novel hybrid architecture 2. Vision Encoder: - Used InternViT-6B-448px-V1-5 as the vision encoder - Implemented dynamic high-resolution (DHR) input handling 3. Language Model: - Used Qwen2-72B-Instruct as the base LLM 4. Training Data Curation: - Carefully curated high-quality pretraining and supervised fine-tuning datasets - Included diverse task-oriented datasets for various capabilities 5. Pretraining: - Froze LLM and vision encoder - Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers) - Used a large batch size of 2048 6. Supervised Fine-Tuning (SFT): - Unfroze LLM while keeping the vision encoder frozen - Trained on multimodal SFT datasets and high-quality text-only SFT data - Implemented 1-D tile tagging for dynamic high-resolution inputs 7. Evaluation: - Evaluated on multiple vision-language benchmarks - Compared performance to leading proprietary and open-source models 8. Optimization: - Iterated on model designs and training approaches - Used smaller 34B models for faster experimentation before scaling to 72B 9. Now comes the best part...Open-Sourcing: - Released model weights and full technical details to the research community The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!

View all activity

Organizations

models

None public yet

datasets

None public yet