The first open Stable Diffusion 3-like architecture model is JUST out π£ - but it is not SD3! π€
It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model πΌοΈβ¨, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english π€ chinese understanding
π 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness
ποΈ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)
Variants π A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics βοΈ An Instruct Edit 2B model was trained, and learned how to do text-replacement
Results β State of the art in automated evals for composition and prompt understanding β Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)