Guillaume THIBAULT's picture

Guillaume THIBAULT

GuillaumeTHIBAULT

AI & ML interests

None yet

Recent Activity

reacted to singhsidhukuldeep's post with 👍 about 1 month ago
Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks. Here is how they did it: 1. Model Architecture Design: - Developed three model architectures: a) NVLM-D: Decoder-only architecture b) NVLM-X: Cross-attention-based architecture c) NVLM-H: Novel hybrid architecture 2. Vision Encoder: - Used InternViT-6B-448px-V1-5 as the vision encoder - Implemented dynamic high-resolution (DHR) input handling 3. Language Model: - Used Qwen2-72B-Instruct as the base LLM 4. Training Data Curation: - Carefully curated high-quality pretraining and supervised fine-tuning datasets - Included diverse task-oriented datasets for various capabilities 5. Pretraining: - Froze LLM and vision encoder - Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers) - Used a large batch size of 2048 6. Supervised Fine-Tuning (SFT): - Unfroze LLM while keeping the vision encoder frozen - Trained on multimodal SFT datasets and high-quality text-only SFT data - Implemented 1-D tile tagging for dynamic high-resolution inputs 7. Evaluation: - Evaluated on multiple vision-language benchmarks - Compared performance to leading proprietary and open-source models 8. Optimization: - Iterated on model designs and training approaches - Used smaller 34B models for faster experimentation before scaling to 72B 9. Now comes the best part...Open-Sourcing: - Released model weights and full technical details to the research community The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!
View all activity

Organizations

EVEIL's profile picture

GuillaumeTHIBAULT's activity

reacted to singhsidhukuldeep's post with 👍 about 1 month ago
view post
Post
1709
Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks.

Here is how they did it:

1. Model Architecture Design:
- Developed three model architectures:
a) NVLM-D: Decoder-only architecture
b) NVLM-X: Cross-attention-based architecture
c) NVLM-H: Novel hybrid architecture

2. Vision Encoder:
- Used InternViT-6B-448px-V1-5 as the vision encoder
- Implemented dynamic high-resolution (DHR) input handling

3. Language Model:
- Used Qwen2-72B-Instruct as the base LLM

4. Training Data Curation:
- Carefully curated high-quality pretraining and supervised fine-tuning datasets
- Included diverse task-oriented datasets for various capabilities

5. Pretraining:
- Froze LLM and vision encoder
- Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers)
- Used a large batch size of 2048

6. Supervised Fine-Tuning (SFT):
- Unfroze LLM while keeping the vision encoder frozen
- Trained on multimodal SFT datasets and high-quality text-only SFT data
- Implemented 1-D tile tagging for dynamic high-resolution inputs

7. Evaluation:
- Evaluated on multiple vision-language benchmarks
- Compared performance to leading proprietary and open-source models

8. Optimization:
- Iterated on model designs and training approaches
- Used smaller 34B models for faster experimentation before scaling to 72B

9. Now comes the best part...Open-Sourcing:
- Released model weights and full technical details to the research community

The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!