VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
Abstract
Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.
Community
The Project page link is not working, 404
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent (2025)
- VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models (2025)
- Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation (2025)
- VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation (2024)
- FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors (2025)
- Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions (2025)
- Large Motion Video Autoencoding with Cross-modal Video VAE (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper