First project of 2025: Vision Transformer Explorer
I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! π€―
QvQ-72B-Previewπ an open weight model for visual reasoning just released by Alibaba_Qwen team Qwen/qvq-676448c820912236342b9888 β¨ Combines visual understanding & language reasoning. β¨ Scores 70.3 on MMMU β¨ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving
* 4 new video models * Multiple image models, including SANA & Flux Control * New quantizers -> GGUF & TorchAO * New training scripts Enjoy this holiday-special Diffusers release π€ Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0
a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser! π Faster and more accurate than Whisper π Privacy-focused (no data leaves your device) β‘οΈ WebGPU accelerated (w/ WASM fallback) π₯ Powered by ONNX Runtime Web and Transformers.js