Vaibhav Srivastav's picture

Vaibhav Srivastav PRO

reach-vb

AI & ML interests

TTS + LM performance prediction

Recent Activity

Articles

Organizations

Hugging Face's profile picture Notebooks-explorers's profile picture Whisper fine-tuning sprint's profile picture Hugging Face Course's profile picture Whisper Fine-Tuning Event's profile picture Kensho's profile picture Mozilla Foundation's profile picture PolinaOrg's profile picture Coqui.ai's profile picture Internal Data & Models for Speech Recognition Event's profile picture Speech Recognition Community Event Version 2's profile picture onnx's profile picture Hugging Test Lab's profile picture Internal Data's profile picture The Team Ten's profile picture Huggingface Projects's profile picture EuroPython 2022's profile picture Whisper Distillation's profile picture BigCode's profile picture Hugging Face OSS Metrics's profile picture Harmonai's Dance Diffusion Community's profile picture EuroSciPy 2022's profile picture LaLoka Labs's profile picture Core ML Projects's profile picture meta-private's profile picture Blog-explorers's profile picture Music Gen Sprint's profile picture Hugging Face for Audio's profile picture Hugging Face TB Research's profile picture Open ASR Leaderboard's profile picture test's profile picture MusicGen Internal's profile picture TTS Eval (OLD)'s profile picture ZeroGPU Explorers's profile picture Editing Audio's profile picture ggml.ai's profile picture LocalLLaMA's profile picture gg-hf's profile picture Unofficial Mistral Community's profile picture Journalists on Hugging Face's profile picture Llzama's profile picture finding-nemo's profile picture diarizers-community's profile picture MLX Community's profile picture Cartesia's profile picture IBM Granite's profile picture On-device Squad's profile picture TTS AGI's profile picture Social Post Explorers's profile picture Apple CoreNet Models 's profile picture LM Studio Community's profile picture gg-gguf's profile picture hsramall's profile picture Parler TTS's profile picture ibm-ai-platform's profile picture Lina Speech's profile picture Dev Mode Explorers's profile picture Sweet Dream(Booth)s's profile picture private beta for deeplinks's profile picture Paris AI Running Club's profile picture gg-tt's profile picture Kyutai's profile picture OuteAI's profile picture Hugging Face Discord Community's profile picture LLHF's profile picture SLLHF's profile picture Ratchet Community's profile picture Hugging Quants's profile picture lbhf's profile picture CoreML Scratchpad's profile picture blhf's profile picture Meta Llama's profile picture kmhf's profile picture nltpt's profile picture nltpt-q's profile picture ai4b-hf's profile picture Ollama Tools's profile picture Spirit LM's profile picture qrias's profile picture Audio Collabs's profile picture Consumer AI Edge Hackathon (Meta, Hugging Face, Pytorch, Scaleway & Unaite)'s profile picture open/ acc's profile picture ExecuTorch Community's profile picture wut?'s profile picture DDUF's profile picture AI Starter Pack's profile picture

reach-vb's activity

reacted to julien-c's post with πŸ‘ 12 days ago
view post
Post
7384
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
replied to julien-c's post 12 days ago
reacted to julien-c's post with πŸ€—β€οΈπŸ”₯ 12 days ago
view post
Post
7384
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
posted an update 15 days ago
view post
Post
3150
VLMs are going through quite an open revolution AND on-device friendly sizes:

1. Google DeepMind w/ PaliGemma2 - 3B, 10B & 28B: google/paligemma-2-release-67500e1e1dbfdd4dee27ba48

2. OpenGVLabs w/ InternVL 2.5 - 1B, 2B, 4B, 8B, 26B, 38B & 78B: https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c

3. Qwen w/ Qwen 2 VL - 2B, 7B & 72B: Qwen/qwen2-vl-66cee7455501d7126940800d

4. Microsoft w/ FlorenceVL - 3B & 8B: https://huggingface.co/jiuhai

5. Moondream2 w/ 0.5B: https://huggingface.co/vikhyatk/

What a time to be alive! πŸ”₯
replied to Duskfallcrew's post 17 days ago
view reply

Hi @nyuuzyou - I'm VB, I work at HF. The team is working around the clock on putting together a setup that works for everyone.

In the meantime I assure you that your models/ dataset are safe and no hard limits are in-place. We're working on it!

Your research/ work is quite important to the community and Hugging Face, always will be.

reacted to their post with ❀️ 27 days ago
view post
Post
3119
Massive week for Open AI/ ML:

Mistral Pixtral & Instruct Large - ~123B, 128K context, multilingual, json + function calling & open weights
mistralai/Pixtral-Large-Instruct-2411
mistralai/Mistral-Large-Instruct-2411

Allen AI TΓΌlu 70B & 8B - competive with claude 3.5 haiku, beats all major open models like llama 3.1 70B, qwen 2.5 and nemotron
allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5
allenai/tulu-3-datasets-673b8df14442393f7213f372

Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision
Xkev/Llama-3.2V-11B-cot

Black Forest Labs Flux.1 tools - four new state of the art model checkpoints & 2 adapters for fill, depth, canny & redux, open weights
reach-vb/black-forest-labs-flux1-6743847bde9997dd26609817

Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64)
jinaai/jina-clip-v2

Apple AIM v2 & CoreML MobileCLIP - large scale vision encoders outperform CLIP and SigLIP. CoreML optimised MobileCLIP models
apple/aimv2-6720fe1558d94c7805f7688c
apple/coreml-mobileclip

A lot more got released like, OpenScholar ( OpenScholar/openscholar-v1-67376a89f6a80f448da411a6), smoltalk ( HuggingFaceTB/smoltalk), Hymba ( nvidia/hymba-673c35516c12c4b98b5e845f), Open ASR Leaderboard ( hf-audio/open_asr_leaderboard) and much more..

Can't wait for the next week! πŸ€—
posted an update 28 days ago
view post
Post
3119
Massive week for Open AI/ ML:

Mistral Pixtral & Instruct Large - ~123B, 128K context, multilingual, json + function calling & open weights
mistralai/Pixtral-Large-Instruct-2411
mistralai/Mistral-Large-Instruct-2411

Allen AI TΓΌlu 70B & 8B - competive with claude 3.5 haiku, beats all major open models like llama 3.1 70B, qwen 2.5 and nemotron
allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5
allenai/tulu-3-datasets-673b8df14442393f7213f372

Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision
Xkev/Llama-3.2V-11B-cot

Black Forest Labs Flux.1 tools - four new state of the art model checkpoints & 2 adapters for fill, depth, canny & redux, open weights
reach-vb/black-forest-labs-flux1-6743847bde9997dd26609817

Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64)
jinaai/jina-clip-v2

Apple AIM v2 & CoreML MobileCLIP - large scale vision encoders outperform CLIP and SigLIP. CoreML optimised MobileCLIP models
apple/aimv2-6720fe1558d94c7805f7688c
apple/coreml-mobileclip

A lot more got released like, OpenScholar ( OpenScholar/openscholar-v1-67376a89f6a80f448da411a6), smoltalk ( HuggingFaceTB/smoltalk), Hymba ( nvidia/hymba-673c35516c12c4b98b5e845f), Open ASR Leaderboard ( hf-audio/open_asr_leaderboard) and much more..

Can't wait for the next week! πŸ€—
reacted to thomwolf's post with πŸ”₯ 28 days ago
reacted to loubnabnl's post with πŸ”₯ 28 days ago
view post
Post
1604
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit πŸ› οΈ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
posted an update about 1 month ago
view post
Post
4321
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! πŸ€—
posted an update about 2 months ago
view post
Post
1583
Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! πŸ”₯

> Pure language modeling approach to TTS
> Zero-shot voice cloning
> LLaMa architecture w/ Audio tokens (WavTokenizer)
> BONUS: Works on-device w/ llama.cpp ⚑

Three-step approach to TTS:

> Audio tokenization using WavTokenizer (75 tok per second)
> CTC forced alignment for word-to-audio token mapping
> Structured prompt creation w/ transcription, duration, audio tokens

The model is extremely impressive for 350M parameters! Kudos to the
OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM πŸ€—

Check out the models here: OuteAI/outetts-6728aa71a53a076e4ba4817c
posted an update about 2 months ago
view post
Post
2971
Smol models ftw! AMD released AMD OLMo 1B - beats OpenELM, tiny llama on MT Bench, Alpaca Eval - Apache 2.0 licensed πŸ”₯

> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs

> Three checkpoints:

- AMD OLMo 1B: Pre-trained model
- AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets
- AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset

Key Insights:
> Pre-trained with less than half the tokens of OLMo-1B
> Post-training steps include two-phase SFT and DPO alignment
> Data for SFT:
- Phase 1: Tulu V2
- Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback

> Model checkpoints on the Hub & Integrated with Transformers ⚑️

Congratulations & kudos to AMD on a brilliant smol model release! πŸ€—

amd/amd-olmo-6723e7d04a49116d8ec95070
reacted to albertvillanova's post with πŸ”₯❀️ about 2 months ago
view post
Post
3114
πŸš€ Exciting update! You can now compare multiple models side-by-side with the Hugging Face Open LLM Comparator! πŸ“Š

open-llm-leaderboard/comparator

Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
posted an update 2 months ago
view post
Post
2448
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! πŸ”₯

1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.

Model checkpoints: reach-vb/sam-21-6702d40defe7611a8bafa881

2. Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance.

Model checkpoints: facebook/layerskip-666b25c50c8ae90e1965727a

3. SALSA: New code enables researchers to benchmark AI-based attacks to validate security for post-quantum cryptography.

Repo: https://github.com/facebookresearch/LWE-benchmarking

4. Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale.

Repo: https://github.com/facebookresearch/lingua

5. Meta Open Materials: New open source models and the largest dataset to accelerate AI-driven discovery of new inorganic materials.

Model checkpoints: fairchem/OMAT24

6. MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder covering 80 languages.

Model checkpoint: facebook/MEXMA

7. Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations.

Model checkpoint: facebook/Self-taught-evaluator-llama3.1-70B

8. Meta Spirit LM: An open-source language model for seamless speech and text integration.

Repo: https://github.com/facebookresearch/spiritlm
  • 3 replies
Β·
posted an update 2 months ago
view post
Post
5441
Multimodal Ichigo Llama 3.1 - Real Time Voice AI πŸ”₯

> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚑

Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)

I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!

(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)
reacted to m-ric's post with πŸ‘€πŸ”₯ 2 months ago
view post
Post
2935
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash ⚑️

New player entered the game! Rhymes AI has just been announced, and unveiled Aria – a multimodal powerhouse that's punching above its weight.

Key insights:

🧠 Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.

🌈 Multimodal: text/image/video β†’ text.

πŸ“š Novel training approach: β€œmultimodal-native” where multimodal training starts directly during pre-training, not just tacked on later

πŸ“ Long 64K token context window

πŸ”“ Apache 2.0 license, with weights, code, and demos all open

⚑️ On the benchmark side, Aria leaves some big names in the dust.

- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista.
- It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.

But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called β€œBeago”. It’s handling even recent events with great accuracy!

And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.

Read their paper for Aria πŸ‘‰Β  Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993)

Try BeaGo 🐢 πŸ‘‰Β https://rhymes.ai/blog-details/introducing-beago-your-smarter-faster-ai-search
  • 1 reply
Β·