1299 158 211

Merve Noyan

merve

https://github.com/merveenoyan/smol-vision

AI & ML interests

VLMs, vision & co

Recent Activity

posted an update 3 days ago

supercharge your LLM apps with smolagents 🔥 however cool your LLM is, without being agentic it can only go so far enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff! Here's our blog for you to get started https://huggingface.co/blog/smolagents

upvoted a collection 3 days ago

QVQ

posted an update 10 days ago

QwQ can see 🔥 Qwen team released QvQ, a large vision LM with reasoning 😱 it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo! Check them out ⬇️ Demo https://huggingface.co/spaces/Qwen/QVQ-72B-preview Model https://huggingface.co/Qwen/QVQ-72B-Preview Read more https://qwenlm.github.io/blog/qvq-72b-preview/ Congratulations @JustinLin610 and team!

View all activity

Articles

Organizations

merve's activity

posted an update 3 days ago

Post

3341

supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents

upvoted a collection 3 days ago

QVQ

Collection

QVQ: Qwen models for visual reasoning • 7 items • Updated 2 days ago • 33

posted an update 10 days ago

Post

4063

QwQ can see 🔥
Qwen team released QvQ, a large vision LM with reasoning 😱

it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out ⬇️
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!

2 replies

reacted to AdinaY's post with 🔥 10 days ago

Post

2859

QvQ-72B-Preview🎄 an open weight model for visual reasoning just released by Alibaba_Qwen team
Qwen/qvq-676448c820912236342b9888
✨ Combines visual understanding & language reasoning.
✨ Scores 70.3 on MMMU
✨ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving

updated a model 14 days ago

merve/colpali_ufo

Updated 14 days ago • 2

updated a model 15 days ago

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • Updated Dec 2, 2024 • 66.5k • 307

New activity in HuggingFaceTB/SmolVLM-Instruct 15 days ago

Add FT tutorial link

#22 opened 15 days ago by

merve

How to training or fientunee SmolVLM easily?

#21 opened 19 days ago by

lucasjin

posted an update 16 days ago

Post

2722

Aya by Cohere For AI can now see! 👀

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! 🗣️

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya 👏
kudos @nahidalam and team

1 reply

upvoted a paper 16 days ago

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 24 days ago • 25

New activity in merve/paligemma_vqav2 16 days ago

Update `dataset` to reference to the actual dataset used

#4 opened 16 days ago by

alvarobartt

posted an update 17 days ago

Post

3149

Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧶

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful 🔥
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥

6 replies

upvoted a paper 17 days ago

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published 21 days ago • 135

posted an update 22 days ago

Post

1734

A complete RAG pipeline includes a reranker, which ranks the documents to find the best document 📓
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 🔥 https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms

New activity in merve/vision_papers 25 days ago

Fix streamlit warning

#3 opened 25 days ago by

lbourdois

updated a collection 25 days ago

Dec 6 Releases 🎄

Collection

28 items • Updated 25 days ago • 10

Merve Noyan

AI & ML interests

Recent Activity

Articles

Introducing smolagents: simple agents that write actions in code.

Welcome PaliGemma 2 – New vision language models by Google

SmolVLM - small yet mighty Vision Language Model

Llama can now see and run on your device - welcome Llama 3.2

Preference Optimization for Vision Language Models

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

PaliGemma – Google's Cutting-Edge Open Vision Language Model

Vision Language Models Explained

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

Deploy MusicGen in no time with Inference Endpoints

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Jupyter X Hugging Face

Using Machine Learning to Aid Survivors and Race through Time

Introducing Skops

Announcing the Hugging Face Fellowship Program

Hosting your Models and Datasets on Hugging Face Spaces using Streamlit

Showcase Your Projects in Spaces using Gradio

Organizations

merve's activity

QVQ

merve/colpali_ufo

HuggingFaceTB/SmolVLM-Instruct

Add FT tutorial link

How to training or fientunee SmolVLM easily?

Maya: An Instruction Finetuned Multilingual Multimodal Model

Update `dataset` to reference to the actual dataset used

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Fix streamlit warning

Dec 6 Releases 🎄