File size: 5,413 Bytes

a0ad75d
 
 
 
 
 
 
22fddbb
6cad2b3
22fddbb
 
84a6333
22fddbb
 
 
 
6cad2b3
22fddbb
 
 
 
e0aa850
22fddbb
6cad2b3
 
 
 
 
22fddbb
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
 
 
6cad2b3
22fddbb
 
6cad2b3
22fddbb
6cad2b3
22fddbb
6cad2b3
22fddbb
 
6cad2b3
22fddbb
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
 
6cad2b3
22fddbb
 
 
6cad2b3
 
 
 
 
22fddbb
 
 
 
 
e0aa850
6cad2b3
 
e0aa850
 
6cad2b3
e0aa850
6cad2b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0aa850
6cad2b3
4ab84a6
 
 
 
 
 
e0aa850
6cad2b3
 
e0aa850
6cad2b3
 
22fddbb
 
ba7e569

---
license: apache-2.0
language:
- en
pipeline_tag: image-to-video
library_name: diffusers
---
<p align="center">
<img src="./assets/banner.gif">
</p>
<p align="center">
 <a href="https://rhymes.ai/allegro_gallery" target="_blank"> Gallery</a> · <a href="https://github.com/rhymes-ai/Allegro" target="_blank">GitHub</a> · <a href="https://rhymes.ai/blog-details/allegro-advanced-video-generation-model" target="_blank">Blog</a> · <a href="https://arxiv.org/abs/2410.15458" target="_blank">Paper</a> · <a href="https://discord.com/invite/u8HxU23myj" target="_blank">Discord</a>   
   
</p> 

# Gallery
<img src="./assets/TI2V_gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](https://rhymes.ai/allegro_gallery).


# Key Feature 

- **Open Source**: Full [model weights](https://huggingface.co/rhymes-ai/Allegro-TI2V) and [code](https://github.com/rhymes-ai/Allegro) available to the community, Apache 2.0!
- **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
- **Text-Image-to-Video Generation**: Generate videos from user-provided prompts and images. Supported input types include:
  - Generating subsequent video content from a user prompt and first frame image.
  - Generating intermediate video content from a user prompt and both first and last frame images.
- **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with [EMA-VFI](https://github.com/MCG-NJU/EMAVFI).
- **Small and Efficient**: Features a 175M parameter VideoVAE and 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.


# Model info 
<table>
  <tr>
    <th>Model</th>
    <td>Allegro-TI2V</td>
    <td>Allegro</td>
  </tr>
  <tr>
    <th>Description</th>
    <td>Text-Image-to-Video Generation Model</td>
    <td>Text-to-Video Generation Model</td>
  </tr>
 <tr>
    <th>Download</th>
    <td><a href="https://huggingface.co/rhymes-ai/Allegro-TI2V">Hugging Face</a></td>
    <td><a href="https://huggingface.co/rhymes-ai/Allegro">Hugging Face</a></td>
</tr>
  <tr>
    <th rowspan="2">Parameter</th>
    <td colspan="2">VAE: 175M</td>
  </tr>
  <tr>
    <td colspan="2">DiT: 2.8B</td>
  </tr>
  <tr>
    <th rowspan="2">Inference Precision</th>
    <td colspan="2">VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
  </tr>
  <tr>
    <td colspan="2">DiT/T5: BF16/FP32/TF32</td>
  </tr>
  <tr>
    <th>Context Length</th>
    <td colspan="2">79.2K</td>
  </tr>
  <tr>
    <th>Resolution</th>
    <td colspan="2">720 x 1280</td>
  </tr>
  <tr>
    <th>Frames</th>
    <td colspan="2">88</td>
  </tr>
  <tr>
    <th>Video Length</th>
    <td colspan="2">6 seconds @ 15 FPS</td>
  </tr>
  <tr>
    <th>Single GPU Memory Usage</th>
    <td colspan="2">9.3G BF16 (with cpu_offload)</td>
  </tr>
    <tr>
    <th>Inference time</th>
    <td colspan="2">20 mins (single H100) / 3 mins (8xH100)</td>
  </tr>
</table>

# Quick start

1. **Download the [Allegro GitHub code](https://github.com/rhymes-ai/Allegro).**

2. **Install the necessary requirements.**
   - Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see [requirements.txt](https://github.com/rhymes-ai/Allegro/blob/main/requirements.txt).  
   - It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.  

3. **Download the [Allegro-TI2V model weights](https://huggingface.co/rhymes-ai/Allegro-TI2V).**

4. **Run inference.**
   ```bash
   python single_inference_ti2v.py \
   --user_prompt 'The car drives along the road.' \
   --first_frame your/path/to/first_frame_image.png \
   --vae your/path/to/vae \
   --dit your/path/to/transformer \
   --text_encoder your/path/to/text_encoder \
   --tokenizer your/path/to/tokenizer \
   --guidance_scale 8 \
   --num_sampling_steps 100 \
   --seed 1427329220
   ```

   The output video resolution is fixed at 720 × 1280. Input images with different resolutions will be automatically cropped and resized to fit.

    | Argument             | Description                                                                                       |
    |----------------------|---------------------------------------------------------------------------------------------------|
    | `--user_prompt`      | [Required] Text input for image-to-video generation.                                              |
    | `--first_frame`      | [Required] First-frame image input for image-to-video generation.                                  |
    | `--last_frame`       | [Optional] If provided, the model will generate intermediate video content based on the specified first and last frame images. |
    | `--enable_cpu_offload` | [Optional] Offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly. |
5. **(Optional) Interpolate the video to 30 FPS**

- It is recommended to use [EMA-VFI](https://github.com/MCG-NJU/EMAVFI) to interpolate the video from 15 FPS to 30 FPS.  
- For better visual quality, you can use imageio to save the video.



# License
This repo is released under the Apache 2.0 License.