The first open Stable Diffusion 3-like architecture model is JUST out 💣 - but it is not SD3! 🤔
It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model 🖼️✨, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english 🤝 chinese understanding
The Stable Diffusion 3 research paper broken down, including some overlooked details! 📝
Model 📏 2 base model variants mentioned: 2B and 8B sizes
📐 New architecture in all abstraction levels: - 🔽 UNet; ⬆️ Multimodal Diffusion Transformer, bye cross attention 👋 - 🆕 Rectified flows for the diffusion process - 🧩 Still a Latent Diffusion Model
📄 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness
🗃️ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)
Variants 🔁 A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics ✏️ An Instruct Edit 2B model was trained, and learned how to do text-replacement
Results ✅ State of the art in automated evals for composition and prompt understanding ✅ Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)