AI & ML interests

Collection of JS libraries to interact with the Hugging Face Hub

Recent Activity

huggingfacejs's activity

merve 
posted an update 7 days ago
Xenova 
posted an update 13 days ago
merve 
posted an update 13 days ago
view post
Post
2691
Aya by Cohere For AI can now see! 👀

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! 🗣️

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya 👏
kudos @nahidalam and team
  • 1 reply
·
merve 
posted an update 14 days ago
view post
Post
3089
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧶

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful 🔥
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥
·
merve 
posted an update 19 days ago
view post
Post
1721
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document 📓
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 🔥 https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
julien-c 
posted an update 21 days ago
view post
Post
7800
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
Xenova 
posted an update 23 days ago
view post
Post
2904
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. 🤗 Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)
merve 
posted an update 23 days ago
view post
Post
5541
This week in open-source AI was insane 🤠 A small recap🕺🏻 merve/dec-6-releases-67545caebe9fc4776faac0a3

Multimodal 🖼️
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants 👏
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license ✨
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts

LLMs 💬
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license 🔥
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! 🔥 nearly 8TB pretraining data in many languages!

Image/Video Generation 🖼️
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux

Audio 🔊
> Indic-Parler-TTS is a new text2speech model made by community
merve 
posted an update 24 days ago
merve 
posted an update 29 days ago
view post
Post
2640
small but mighty 🔥
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work 💝 https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
merve 
posted an update 29 days ago
view post
Post
2878
Last week we were blessed with open-source models! A recap 💝
merve/nov-29-releases-674ccc255a57baf97b1e2d31

🖼️ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model 💗
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents 🤖
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model📖
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT 📕

💬 LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet 🔥
> AliBaba has released Marco-o1, a new open-source reasoning model 💥
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)

⏯️ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️

Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
julien-c 
posted an update about 1 month ago
view post
Post
2278
wow 😮

INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.

PrimeIntellect/INTELLECT-1-Instruct
Xenova 
posted an update about 1 month ago
view post
Post
3943
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! 🤯 Let's take a look:
🔀 Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text)
👁️ Qwen2-VL from Qwen for dynamic-resolution image understanding
🔢 JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings
🌋 LLaVA-OneVision from ByteDance for Image-Text-to-Text generation
🤸‍♀️ ViTPose for pose estimation
📄 MGP-STR for optical character recognition (OCR)
📈 PatchTST & PatchTSMixer for time series forecasting

That's right, everything running 100% locally in your browser (no data sent to a server)! 🔥 Huge for privacy!

Check out the release notes for more information. 👇
https://github.com/huggingface/transformers.js/releases/tag/3.1.0

Demo link (+ source code): webml-community/Janus-1.3B-WebGPU
merve 
posted an update about 1 month ago
view post
Post
2171
The authors of ColPali trained a retrieval model based on SmolVLM 🤠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 💗
merve 
posted an update about 1 month ago
view post
Post
3887
Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO 💝
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO 💗
coyotte508 
updated a Space about 1 month ago
merve 
posted an update about 1 month ago
view post
Post
2587
What a week! A recap for everything you missed ❄️
merve/nov-22-releases-673fbbcfc1c97c4f411def07
Multimodal ✨
> Mistral AI
released Pixtral 124B, a gigantic open vision language model
> Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU
> OpenGVLab released MMPR: a new multimodal reasoning dataset
> Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings
> Apple released new SotA vision encoders AIMv2

LLMs 🦙
> AllenAI dropped a huge release of models, datasets and scripts for Tülu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR
> Jina has released embeddings-v3: new multilingual embeddings with longer context
> Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning
> Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs

Image Generation 🖼️
> Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations

Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them 📚
$ pip install observers
  • 3 replies
·
merve 
posted an update about 1 month ago
view post
Post
1501
Apple released AIMv2 🍏 a family of state-of-the-art open-set vision encoders
apple/aimv2-6720fe1558d94c7805f7688c
> like CLIP, but add a decoder and train on autoregression 🤯
> 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448
> Load and use with 🤗 transformers
merve 
posted an update about 1 month ago
view post
Post
3125
your hugging face profile now has your recent activities 🤗
Xenova 
posted an update about 1 month ago
view post
Post
5548
Have you tried out 🤗 Transformers.js v3? Here are the new features:
⚡ WebGPU support (up to 100x faster than WASM)
🔢 New quantization formats (dtypes)
🏛 120 supported architectures in total
📂 25 new example projects and templates
🤖 Over 1200 pre-converted models
🌐 Node.js (ESM + CJS), Deno, and Bun compatibility
🏡 A new home on GitHub and NPM

Get started with npm i @huggingface/transformers.

Learn more in our blog post: https://huggingface.co/blog/transformersjs-v3
  • 3 replies
·