Nishith Jain

KingNish

AI & ML interests

AI is fun actually.

Articles

Organizations

KingNish's activity

Reacted to merve's post with 🔥 23 days ago
view post
Post
5360
Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
Reacted to prithivMLmods's post with 👍 23 days ago
view post
Post
4556
New Droppings🥳

😶‍🌫️Collection: prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be

🥳Demo Here: prithivMLmods/FLUX-LoRA-DLC with more than 100+ Flux LoRA's

🪨Fluid Dramatic Neon: prithivMLmods/Castor-Dramatic-Neon-Flux-LoRA
🪨Past & Present Blend: prithivMLmods/Past-Present-Deep-Mix-Flux-LoRA
🪨Tarot Cards Refreshed Themes: prithivMLmods/Ton618-Tarot-Cards-Flux-LoRA
🪨Amxtoon Character Mix Real-Anime: prithivMLmods/Ton618-Amxtoon-Flux-LoRA
🪨Epic Realism Flux v1: prithivMLmods/Ton618-Epic-Realism-Flux-LoRA
🪨Mock-up Textures: prithivMLmods/Mockup-Texture-Flux-LoRA
.
.
.
@prithivMLmods 🤗
  • 2 replies
·
Reacted to thomwolf's post with 🚀 about 1 month ago
view post
Post
4829
Is is time for the open-source AI robots revolution 🚀?

With @haixuantao and @Leyo we’ve been playing with a low-cost DJI robot controlled by three local open-source AI models (Whisper, Idefics2, Parler-TTS - all Apache2) and orchestrated by Dora-cs.

Links to find all the hardware/software we used in the demo:
- robot control framework – dora-rs: https://github.com/dora-rs/dora
- speech-to-text model – whisper: openai/whisper-base
- vision-text model – Idefics2: HuggingFaceM4/idefics2-8b-AWQ
- text-to-speech model – ParlerTTS mini: parler-tts/parler_tts_mini_v0.1
- robot: https://dji.com/robomaster-s1
- code gist: https://gist.github.com/haixuanTao/860e1740245dc2c8dd85b496150a9320
- Larger codebase: dora-rs/dora-idefics2
- laptop/pc: any with a recent GPU card (our has a RTX 4090)

Enjoy!
·
Reacted to singhsidhukuldeep's post with 👀 about 1 month ago
view post
Post
1737
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.

Some highlights of the paper:

• Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model
• Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks
• Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines
• Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets
• Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts
• Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups


So, how's it done?

Based on the KV Prediction method described in the paper, here are the key steps for how it's done:

1. Choose a base model and an auxiliary model:
- The base model is a larger, pretrained transformer model that will be used for final generation.
- The auxiliary model is a smaller transformer model used to efficiently process the input prompt.

2. Design the KV predictor:
- Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache.
- Define a mapping from auxiliary cache layers to base cache layers.

3. Training process:
- Pass input tokens through the auxiliary model to get its KV cache.
- Use the KV predictor to generate a predicted KV cache for the base model.
- Run the base model using the predicted KV cache and compute losses.
- Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.

4. Inference process:
- Process the input prompt with the auxiliary model to get its KV cache.
- Use the KV predictor to generate the predicted base model KV cache.
- Run a single token generation step with the base model using the predicted KV cache.
- Continue autoregressive generation with the base model as normal.

Excited to hear your thoughts!
Reacted to Pendrokar's post with 🔥 about 1 month ago
view post
Post
1365
Made a notable change to the TTS Arena fork. I do not think anyone is interested in which bottomfeeder TTS is better than another beside it. So one of the top 5 TTS is always chosen in a challenge for more scrutiny. Also these top 5 are taken from preliminary results.
Pendrokar/TTS-Spaces-Arena
Reacted to victor's post with 🔥 about 1 month ago
Reacted to reach-vb's post with 🔥 about 1 month ago
view post
Post
5370
Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥

> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚡

Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)

I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!

(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)
Reacted to mervenoyan's post with 🔥 about 1 month ago
Reacted to DeFactOfficial's post with 🚀 about 1 month ago
view post
Post
2155
Ladies and Gents, please try my new Assistant, Image Gen - Uncensored Edition, on HuggingChat.

https://hf.co/chat/assistant/66fccce0c0fafc94ab557ef2

This is a multimodal assistant: Qwen 2.5 72B + SOTA diffusion models for image generation. Same architecture as Image Gen+ but with some MAJOR improvements ! These are as follows:

- Switched the LLM to Qwen 2.5 72B, the most powerful model currently available on HuggingChat. This results in higher quality prompts for the txt2img model and much better adherence to the prompt-url format that the upstream provider requires (image gen models are hosted by pollinations as with most other assistants on huggingchat that offer image generation).

- Cleaned up the system prompt including the examples of the prompt-in-url format, and adjusted the logic that determines how many images to generate based on the quality of user prompt... these changes further improve

- Assistant has access to multiple image generation models and will by default choose whatever model is most appropriate for the task. This includes NSFW generations, which it makes using an uncensored SD3 turbo. For other workloads, the Assistant preferentially uses one of the flux variants or any-dark (an artistic SDXL finetune), based on the nature of the task. Available models include turbo, flux, flux-realism, flux-anime, flux-3d, any-dark

- Added verbiage to system prompt which greatly reduces censorship / refusals by the LLM (the txt2img models are uncensored to start off)

Here are the user-entered prompts used to create the images you see here... feel free to try them yourself!

"Ayatollah Khameini and Kamala Harris having a secret romantic rendezvous. Use flux-realism model"
"A self portrait of your consciousness"
"The chien of andalous, in a psychedelic style"
"Make me 4 paintings in the style of Frida Kahlo that I can sell to tourists in a mexican hippie town"
"Paint me a van gogh and greg rutkowski style scene involving elephants and gerbils"
·
Reacted to victor's post with 🤗 about 2 months ago
view post
Post
2658
NEW - Inference Playground

Maybe like me you have always wanted a super easy way to compare llama3.2-1B vs. llama3.2-3B? or the same model with different temperatures?

Trying and comparing warm Inference API models has never been easier!
Just go to https://hf.co/playground, set your token and you're ready to go.
We'll keep improving, feedback welcome 😊
  • 2 replies
·
Reacted to reach-vb's post with 🔥 about 2 months ago
view post
Post
3063
NEW: Open Source Text/ Image to video model is out - MIT licensed - Rivals Gen-3, Pika & Kling 🔥

> Pyramid Flow: Training-efficient Autoregressive Video Generation method
> Utilizes Flow Matching
> Trains on open-source datasets
> Generates high-quality 10-second videos
> Video resolution: 768p
> Frame rate: 24 FPS
> Supports image-to-video generation

> Model checkpoints available on the hub 🤗: rain1011/pyramid-flow-sd3
Reacted to m-ric's post with 🔥 about 2 months ago
view post
Post
2916
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash ⚡️

New player entered the game! Rhymes AI has just been announced, and unveiled Aria – a multimodal powerhouse that's punching above its weight.

Key insights:

🧠 Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.

🌈 Multimodal: text/image/video → text.

📚 Novel training approach: “multimodal-native” where multimodal training starts directly during pre-training, not just tacked on later

📏 Long 64K token context window

🔓 Apache 2.0 license, with weights, code, and demos all open

⚡️ On the benchmark side, Aria leaves some big names in the dust.

- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista.
- It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.

But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called “Beago”. It’s handling even recent events with great accuracy!

And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.

Read their paper for Aria 👉  Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993)

Try BeaGo 🐶 👉 https://rhymes.ai/blog-details/introducing-beago-your-smarter-faster-ai-search
  • 1 reply
·
Reacted to merve's post with 🔥 about 2 months ago
view post
Post
3744
Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV 🤗

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos 👏

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker
Reacted to MonsterMMORPG's post with ❤️ about 2 months ago
view post
Post
4066
Huge news for Kohya GUI - Now you can fully Fine Tune / DreamBooth FLUX Dev with as low as 6 GB GPUs without any quality loss compared to 48 GB GPUs - Moreover, Fine Tuning yields better results than any LoRA training could

Config Files
I published all configs here : https://www.patreon.com/posts/112099700

Tutorials
Fine tuning tutorial in production

Windows FLUX LoRA training (fine tuning is same just config changes) : https://youtu.be/nySGu12Y05k

Cloud FLUX LoRA training (RunPod and Massed Compute ultra cheap) : https://youtu.be/-uhL2nW7Ddw

LoRA Extraction
The checkpoint sizes are 23.8 GB but you can extract LoRA with almost no loss quality - I made a research and public article / guide for this as well

LoRA extraction guide from Fine Tuned checkpoint is here : https://www.patreon.com/posts/112335162

Info
This is just mind blowing. The recent improvements Kohya made for block swapping is just amazing.

Speeds are also amazing that you can see in image 2 - of course those values are based on my researched config and tested on RTX A6000 - same speed as almost RTX 3090

Also all trainings experiments are made at 1024x1024px. If you use lower resolution it will be lesser VRAM + faster speed

The VRAM usages would change according to your own configuration - likely speed as well

Moreover, Fine Tuning / DreamBooth yields better results than any LoRA could

Installers
1-Kohya GUI accurate branch and Windows Torch 2.5 Installers and test prompts shared here : https://www.patreon.com/posts/110879657

The link of Kohya GUI with accurate branch : https://github.com/bmaltais/kohya_ss/tree/sd3-flux.1
Reacted to clem's post with ❤️ about 2 months ago
view post
Post
4150
Open-source AI creates healthy competition in a field where natural tendencies lead to extreme concentration of power. Imagine a world where only one or two companies could build software. This is the biggest risk and ethical challenge of them all IMO. Let's fight this!
  • 3 replies
·
Reacted to Tonic's post with 👀 about 2 months ago
Reacted to clem's post with 🚀❤️ about 2 months ago
view post
Post
3700
Very few people realize that most of the successful AI startups got successful because they were focused on open science and open-source for at least their first few years. To name but a few, OpenAI (GPT, GPT2 was open-source), Runway & Stability (stable diffusion), Cohere, Mistral and of course Hugging Face!

The reasons are not just altruistic, it's also because sharing your science and your models pushes you to build AI faster (which is key in a fast-moving domain like AI), attracts the best scientists & engineers and generates much more visibility, usage and community contributions than if you were 100% closed-source. The same applies to big tech companies as we're seeing with Meta and Google!

More startups and companies should release research & open-source AI, it's not just good for the world but also increases their probability of success!
·
Reacted to awacke1's post with 🔥 about 2 months ago
view post
Post
989
Updated my 📺RTV🖼️ - Real Time Video AI app this morning.
URL: awacke1/stable-video-diffusion

It uses Stable Diffusion to dynamically create videos from images in input directory or uploaded using A10 GPU on Huggingface.


Samples below.

I may transition this to Zero GPU if I can. During Christmas when I revised this I had my highest billing from HF yet due to GPU usage. It is still the best turn key GPU out and Image2Video is a killer app. Thanks HF for the possibilities!
Reacted to Tonic's post with 🔥 about 2 months ago