James Le

khanhnamle1994

https://jameskle.com/

AI & ML interests

Multimodal AI, Video Understanding

Recent Activity

upvoted a paper 3 months ago

Apollo: An Exploration of Video Understanding in Large Multimodal Models

upvoted a collection 3 months ago

Nucleotide Transformer

reacted to merve's post with 👍 6 months ago

NVIDIA just dropped NVEagle 🦅 Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat 💬 Model repositories: https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 Try it: https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat 💬 (works very well! 🤯) This model essentially explores having different experts (MoE) for image encoder part of vision language model. How? 🧐 The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder. Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨ In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well. Rest of the architecture is quite similar to LLaVA. (see below the architecture)

View all activity

Organizations

khanhnamle1994's activity

upvoted a paper 3 months ago

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 140

upvoted a collection 3 months ago

Nucleotide Transformer

Collection

13 items • Updated Sep 12, 2024 • 14

reacted to merve's post with 👍 6 months ago

Post

2394

NVIDIA just dropped NVEagle 🦅

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat 💬
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat 💬 (works very well! 🤯)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? 🧐
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning ✨

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)

upvoted a paper 7 months ago

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21, 2024 • 56

upvoted a paper 10 months ago

Pegasus-v1 Technical Report

Paper • 2404.14687 • Published Apr 23, 2024 • 32

authored a paper 11 months ago

Pegasus-v1 Technical Report

Paper • 2404.14687 • Published Apr 23, 2024 • 32