RhymesAI commited on
Commit
6cad2b3
verified
1 Parent(s): 834c6e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -65
README.md CHANGED
@@ -6,7 +6,7 @@ pipeline_tag: image-to-video
6
  library_name: diffusers
7
  ---
8
  <p align="center">
9
- <img src="https://github.com/rhymes-ai/Allegro/blob/main/assets/TI2V_banner.gif">
10
  </p>
11
  <p align="center">
12
  <a href="https://rhymes.ai/allegro_gallery" target="_blank"> Gallery</a> 路 <a href="https://github.com/rhymes-ai/Allegro" target="_blank">GitHub</a> 路 <a href="https://rhymes.ai/blog-details/allegro-advanced-video-generation-model" target="_blank">Blog</a> 路 <a href="https://arxiv.org/abs/2410.15458" target="_blank">Paper</a> 路 <a href="https://discord.com/invite/u8HxU23myj" target="_blank">Discord</a> 路 <a href="https://docs.google.com/forms/d/e/1FAIpQLSfq4Ez48jqZ7ncI7i4GuL7UyCrltfdtrOCDnm_duXxlvh5YmQ/viewform" target="_blank">Join Waitlist</a> (Try it on Discord!)
@@ -14,118 +14,122 @@ library_name: diffusers
14
  </p>
15
 
16
  # Gallery
17
- <img src="https://huggingface.co/rhymes-ai/Allegro-TI2V/blob/main/assets/TI2V_gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](https://rhymes.ai/allegro_gallery).
18
 
19
 
20
  # Key Feature
21
 
22
  - **Open Source**: Full [model weights](https://huggingface.co/rhymes-ai/Allegro) and [code](https://github.com/rhymes-ai/Allegro) available to the community, Apache 2.0!
23
  - **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
24
- - **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI).
25
- - **Small and Efficient**: Features a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.
 
 
 
26
 
27
- # Model info
28
 
 
29
  <table>
30
  <tr>
31
  <th>Model</th>
 
32
  <td>Allegro</td>
33
  </tr>
34
  <tr>
35
  <th>Description</th>
 
36
  <td>Text-to-Video Generation Model</td>
37
  </tr>
38
- <tr>
39
  <th>Download</th>
 
40
  <td><a href="https://huggingface.co/rhymes-ai/Allegro">Hugging Face</a></td>
41
- </tr>
42
  <tr>
43
  <th rowspan="2">Parameter</th>
44
- <td>VAE: 175M</td>
45
  </tr>
46
  <tr>
47
- <td>DiT: 2.8B</td>
48
  </tr>
49
  <tr>
50
  <th rowspan="2">Inference Precision</th>
51
- <td>VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
52
  </tr>
53
  <tr>
54
- <td>DiT/T5: BF16/FP32/TF32</td>
55
  </tr>
56
  <tr>
57
  <th>Context Length</th>
58
- <td>79.2K</td>
59
  </tr>
60
  <tr>
61
  <th>Resolution</th>
62
- <td>720 x 1280</td>
63
  </tr>
64
  <tr>
65
  <th>Frames</th>
66
- <td>88</td>
67
  </tr>
68
  <tr>
69
  <th>Video Length</th>
70
- <td>6 seconds @ 15 FPS</td>
71
  </tr>
72
  <tr>
73
  <th>Single GPU Memory Usage</th>
74
- <td>9.3G BF16 (with cpu_offload)</td>
 
 
 
 
75
  </tr>
76
  </table>
77
 
78
-
79
  # Quick start
80
 
81
- 1. Install the necessary requirements.
82
-
83
- - Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4.
84
-
85
- - It is recommended to use Anaconda to create a new environment (Python >= 3.10) `conda create -n rllegro python=3.10 -y` to run the following example.
86
-
87
- - run `pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4`
88
-
89
- 2. Run inference.
90
- ```python
91
- import torch
92
- from diffusers import AutoencoderKLAllegro, AllegroPipeline
93
- from diffusers.utils import export_to_video
94
- vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
95
- pipe = AllegroPipeline.from_pretrained(
96
- "rhymes-ai/Allegro", vae=vae, torch_dtype=torch.bfloat16
97
- )
98
- pipe.to("cuda")
99
- pipe.vae.enable_tiling()
100
- prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."
101
-
102
- positive_prompt = """
103
- (masterpiece), (best quality), (ultra-detailed), (unwatermarked),
104
- {}
105
- emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo,
106
- sharp focus, high budget, cinemascope, moody, epic, gorgeous
107
- """
108
-
109
- negative_prompt = """
110
- nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality,
111
- low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
112
- """
113
-
114
- prompt = prompt.format(prompt.lower().strip())
115
-
116
- video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
117
- export_to_video(video, "output.mp4", fps=15)
118
- ```
119
-
120
- Use `pipe.enable_sequential_cpu_offload()` to offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.
121
-
122
- 3. (Optional) Interpolate the video to 30 FPS.
123
-
124
- It is recommended to use [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI) to interpolate the video from 15 FPS to 30 FPS.
125
-
126
- For better visual quality, please use imageio to save the video.
127
-
128
- 4. For faster inference such Context Parallel, PAB, please refer to our [github repo](https://github.com/rhymes-ai/Allegro).
129
 
130
  # License
131
  This repo is released under the Apache 2.0 License.
 
6
  library_name: diffusers
7
  ---
8
  <p align="center">
9
+ <img src="./assets/banner.gif">
10
  </p>
11
  <p align="center">
12
  <a href="https://rhymes.ai/allegro_gallery" target="_blank"> Gallery</a> 路 <a href="https://github.com/rhymes-ai/Allegro" target="_blank">GitHub</a> 路 <a href="https://rhymes.ai/blog-details/allegro-advanced-video-generation-model" target="_blank">Blog</a> 路 <a href="https://arxiv.org/abs/2410.15458" target="_blank">Paper</a> 路 <a href="https://discord.com/invite/u8HxU23myj" target="_blank">Discord</a> 路 <a href="https://docs.google.com/forms/d/e/1FAIpQLSfq4Ez48jqZ7ncI7i4GuL7UyCrltfdtrOCDnm_duXxlvh5YmQ/viewform" target="_blank">Join Waitlist</a> (Try it on Discord!)
 
14
  </p>
15
 
16
  # Gallery
17
+ <img src="./assets/TI2V_gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](https://rhymes.ai/allegro_gallery).
18
 
19
 
20
  # Key Feature
21
 
22
  - **Open Source**: Full [model weights](https://huggingface.co/rhymes-ai/Allegro) and [code](https://github.com/rhymes-ai/Allegro) available to the community, Apache 2.0!
23
  - **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
24
+ - **Text-Image-to-Video Generation**: Generate videos from user-provided prompts and images. Supported input types include:
25
+ - Generating subsequent video content from a user prompt and first frame image.
26
+ - Generating intermediate video content from a user prompt and both first and last frame images.
27
+ - **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with [EMA-VFI](https://github.com/MCG-NJU/EMAVFI).
28
+ - **Small and Efficient**: Features a 175M parameter VideoVAE and 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.
29
 
 
30
 
31
+ # Model info
32
  <table>
33
  <tr>
34
  <th>Model</th>
35
+ <td>Allegro-TI2V</td>
36
  <td>Allegro</td>
37
  </tr>
38
  <tr>
39
  <th>Description</th>
40
+ <td>Text-Image-to-Video Generation Model</td>
41
  <td>Text-to-Video Generation Model</td>
42
  </tr>
43
+ <tr>
44
  <th>Download</th>
45
+ <td><a href="https://huggingface.co/rhymes-ai/Allegro-TI2V">Hugging Face</a></td>
46
  <td><a href="https://huggingface.co/rhymes-ai/Allegro">Hugging Face</a></td>
47
+ </tr>
48
  <tr>
49
  <th rowspan="2">Parameter</th>
50
+ <td colspan="2">VAE: 175M</td>
51
  </tr>
52
  <tr>
53
+ <td colspan="2">DiT: 2.8B</td>
54
  </tr>
55
  <tr>
56
  <th rowspan="2">Inference Precision</th>
57
+ <td colspan="2">VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
58
  </tr>
59
  <tr>
60
+ <td colspan="2">DiT/T5: BF16/FP32/TF32</td>
61
  </tr>
62
  <tr>
63
  <th>Context Length</th>
64
+ <td colspan="2">79.2K</td>
65
  </tr>
66
  <tr>
67
  <th>Resolution</th>
68
+ <td colspan="2">720 x 1280</td>
69
  </tr>
70
  <tr>
71
  <th>Frames</th>
72
+ <td colspan="2">88</td>
73
  </tr>
74
  <tr>
75
  <th>Video Length</th>
76
+ <td colspan="2">6 seconds @ 15 FPS</td>
77
  </tr>
78
  <tr>
79
  <th>Single GPU Memory Usage</th>
80
+ <td colspan="2">9.3G BF16 (with cpu_offload)</td>
81
+ </tr>
82
+ <tr>
83
+ <th>Inference time</th>
84
+ <td colspan="2">20 mins (single H100) / 3 mins (8xH100)</td>
85
  </tr>
86
  </table>
87
 
 
88
  # Quick start
89
 
90
+ 1. **Download the Allegro GitHub code.**
91
+
92
+ 2. **Install the necessary requirements.**
93
+ 1. Ensure the following dependencies are met:
94
+ - Python >= 3.10
95
+ - PyTorch >= 2.4
96
+ - CUDA >= 12.4
97
+ For details, see `requirements.txt`.
98
+ 2. It is recommended to use Anaconda to create a new environment (Python >= 3.10) for running the example.
99
+
100
+ 3. **Download the Allegro-TI2V model weights.**
101
+
102
+ 4. **Run inference.**
103
+ ```bash
104
+ python single_inference_ti2v.py \
105
+ --user_prompt 'The car drives along the road.' \
106
+ --first_frame your/path/to/first_frame_image.png \
107
+ --vae your/path/to/vae \
108
+ --dit your/path/to/transformer \
109
+ --text_encoder your/path/to/text_encoder \
110
+ --tokenizer your/path/to/tokenizer \
111
+ --guidance_scale 8 \
112
+ --num_sampling_steps 100 \
113
+ --seed 1427329220
114
+ ```
115
+
116
+ The output video resolution is fixed at **720 脳 1280**. Input images with different resolutions will be automatically cropped and resized to fit.
117
+
118
+ ### Arguments and Descriptions
119
+
120
+ | Argument | Description |
121
+ |----------------------|---------------------------------------------------------------------------------------------------|
122
+ | `--user_prompt` | [Required] Text input for image-to-video generation. |
123
+ | `--first_frame` | [Required] First-frame image input for image-to-video generation. |
124
+ | `--last_frame` | [Optional] If provided, the model will generate intermediate video content based on the specified first and last frame images. |
125
+ | `--enable_cpu_offload` | [Optional] Offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly. |
126
+
127
+ ### (Optional) Interpolate the video to 30 FPS
128
+
129
+ - It is recommended to use [EMA-VFI](https://github.com/MCG-NJU/EMAVFI) to interpolate the video from 15 FPS to 30 FPS.
130
+ - For better visual quality, you can use `imageio` to save the video.
131
+
132
+
 
 
 
 
 
133
 
134
  # License
135
  This repo is released under the Apache 2.0 License.