title: README
emoji: π
colorFrom: green
colorTo: pink
sdk: static
pinned: false
MJ-Bench Team
MJ-Bench-Team is co-founded by Stanford University, UNC-Chapel Hill, and the University of Chicago. We aim to align modern foundation models with multimodal judges to enhance reliability, safety, and performance.
Recent News
- π₯ We have released MJ-Video. All datasets and model checkpoints are available here!
- π MJ-PreferGen is accepted by ICLR25! Check out the paper: MJ-PreferGen: An Automatic Framework for Preference Data Synthesis.
π MJ-Video: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
- Project page: https://aiming-lab.github.io/MJ-VIDEO.github.io/
- Code repository: https://github.com/aiming-lab/MJ-Video
We release MJ-Bench-Video, a comprehensive fine-grained video preference benchmark, and MJ-Video, a powerful MoE-based multi-dimensional video reward model!
π©ββοΈ MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
- Project page: https://mj-bench.github.io/
- Code repository: https://github.com/MJ-Bench/MJ-Bench
Text-to-image models like DALLE-3 and Stable Diffusion are proliferating rapidly, but they often encounter challenges such as hallucination, bias, and unsafe or low-quality output. To effectively address these issues, itβs crucial to align these models with desired behaviors based on feedback from a multimodal judge.
However, current multimodal judges are often under-evaluated, leading to possible misalignment and safety concerns during fine-tuning. To tackle this, we introduce MJ-Bench, a new benchmark featuring a comprehensive preference dataset to evaluate multimodal judges on four critical dimensions:
- Alignment
- Safety
- Image Quality
- Bias
We evaluate a wide range of multimodal judges, including:
- 6 smaller-sized CLIP-based scoring models
- 11 open-source VLMs (e.g., the LLaVA family)
- 4 closed-source VLMs (e.g., GPT-4, Claude 3)
π₯ We are actively updating the leaderboard!
You are welcome to submit your multimodal judgeβs evaluation results on our dataset to the Hugging Face leaderboard.