hkunzhe commited on
Commit
d373fa0
1 Parent(s): e01a17c

update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -3
README.md CHANGED
@@ -1,3 +1,143 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVideoX-Fun-V1.5-Reward-LoRAs
2
+ ## Introduction
3
+ We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [CogVideoX-Fun-V1.5](https://github.com/aigc-apps/CogVideoX-Fun) for better alignment with human preferences.
4
+ We provide the following pre-trained models (i.e. LoRAs) along with [the training script](https://github.com/aigc-apps/CogVideoX-Fun/blob/main/scripts/train_reward_lora.py). You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
5
+
6
+ For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/CogVideoX-Fun).
7
+
8
+ | Name | Base Model | Reward Model | Hugging Face | Description |
9
+ |--|--|--|--|--|
10
+ | CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors | [CogVideoX-Fun-V1.5-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 1,500 steps.|
11
+ | CogVideoX-Fun-V1.5-5b-InP-MPS.safetensors | [CogVideoX-Fun-V1.5-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.5-5b-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 5,500 steps.|
12
+
13
+ ## Demo
14
+ ### CogVideoX-Fun-V1.5-5B
15
+
16
+ <table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
17
+ <thead>
18
+ <tr>
19
+ <th style="text-align: center;" width="10%">Prompt</sup></th>
20
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B</th>
21
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B <br> HPSv2.1 Reward LoRA</th>
22
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B <br> MPS Reward LoRA</th>
23
+ </tr>
24
+ </thead>
25
+ <tr>
26
+ <td>
27
+ A panda eats bamboo while a monkey swings from branch to branch
28
+ </td>
29
+ <td>
30
+ <video src="https://github.com/user-attachments/assets/ec752b06-cb13-4f9d-9c47-260536deba49" width="100%" controls autoplay loop></video>
31
+ </td>
32
+ <td>
33
+ <video src="https://github.com/user-attachments/assets/537a923c-fb64-474d-bbfb-c8ddf502a212" width="100%" controls autoplay loop></video>
34
+ </td>
35
+ <td>
36
+ <video src="https://github.com/user-attachments/assets/6bb3b860-57d3-4ac3-8898-b72b40753f2f" width="100%" controls autoplay loop></video>
37
+ </td>
38
+ </tr>
39
+ <tr>
40
+ <td>
41
+ A penguin waddles on the ice, a camel treks by
42
+ </td>
43
+ <td>
44
+ <video src="https://github.com/user-attachments/assets/ad551233-5acf-4974-91cc-cd18591acbf4" width="100%" controls autoplay loop></video>
45
+ </td>
46
+ <td>
47
+ <video src="https://github.com/user-attachments/assets/2763fe09-436b-4407-9e6d-385518e1720c" width="100%" controls autoplay loop></video>
48
+ </td>
49
+ <td>
50
+ <video src="https://github.com/user-attachments/assets/19b93c29-5e7b-414f-914d-ae010f8faf29" width="100%" controls autoplay loop></video>
51
+ </td>
52
+ </tr>
53
+ <tr>
54
+ <td>
55
+ Elderly artist with a white beard painting on a white canvas
56
+ </td>
57
+ <td>
58
+ <video src="https://github.com/user-attachments/assets/3560f91f-c68f-4567-a880-e3297464fb89" width="100%" controls autoplay loop></video>
59
+ </td>
60
+ <td>
61
+ <video src="https://github.com/user-attachments/assets/abbf827c-41e3-4e8b-9771-2f3b788985ca" width="100%" controls autoplay loop></video>
62
+ </td>
63
+ <td>
64
+ <video src="https://github.com/user-attachments/assets/328c85ce-1d22-428d-bf6d-1152d0457563" width="100%" controls autoplay loop></video>
65
+ </td>
66
+ </tr>
67
+ <tr>
68
+ <td>
69
+ Crystal cake shimmering beside a metal apple
70
+ </td>
71
+ <td>
72
+ <video src="https://github.com/user-attachments/assets/a94c74d3-8b75-41c3-9b21-0d53f9c67781" width="100%" controls autoplay loop></video>
73
+ </td>
74
+ <td>
75
+ <video src="https://github.com/user-attachments/assets/c9509e81-8bf7-4023-b8dd-1a3f7e5def3a" width="100%" controls autoplay loop></video>
76
+ </td>
77
+ <td>
78
+ <video src="https://github.com/user-attachments/assets/37157443-0cc7-4371-9f24-ec228124c206" width="100%" controls autoplay loop></video>
79
+ </td>
80
+ </tr>
81
+ </table>
82
+
83
+ > [!NOTE]
84
+ > The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.
85
+
86
+ ## Quick Start
87
+ We provide a simple inference code to run CogVideoX-Fun-V1.5-5b-InP with its HPS2.1 reward LoRA.
88
+
89
+ ```python
90
+ import torch
91
+ from diffusers import CogVideoXDDIMScheduler
92
+
93
+ from cogvideox.models.transformer3d import CogVideoXTransformer3DModel
94
+ from cogvideox.pipeline.pipeline_cogvideox_inpaint import CogVideoX_Fun_Pipeline_Inpaint
95
+ from cogvideox.utils.lora_utils import merge_lora
96
+ from cogvideox.utils.utils import get_image_to_video_latent, save_videos_grid
97
+
98
+ model_path = "alibaba-pai/CogVideoX-Fun-V1.5-5b-InP"
99
+ lora_path = "alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors"
100
+ lora_weight = 0.7
101
+
102
+ prompt = "Pig with wings flying above a diamond mountain"
103
+ sample_size = [512, 512]
104
+ video_length = 85
105
+
106
+ transformer = CogVideoXTransformer3DModel.from_pretrained_2d(model_path, subfolder="transformer").to(torch.bfloat16)
107
+ scheduler = CogVideoXDDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
108
+ pipeline = CogVideoX_Fun_Pipeline_Inpaint.from_pretrained(
109
+ model_path, transformer=transformer, scheduler=scheduler, torch_dtype=torch.bfloat16
110
+ )
111
+ pipeline.enable_model_cpu_offload()
112
+ pipeline = merge_lora(pipeline, lora_path, lora_weight)
113
+
114
+ generator = torch.Generator(device="cuda").manual_seed(42)
115
+ input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
116
+ sample = pipeline(
117
+ prompt,
118
+ num_frames = video_length,
119
+ negative_prompt = "bad detailed",
120
+ height = sample_size[0],
121
+ width = sample_size[1],
122
+ generator = generator,
123
+ guidance_scale = 7.0,
124
+ num_inference_steps = 50,
125
+ video = input_video,
126
+ mask_video = input_video_mask,
127
+ ).videos
128
+
129
+ save_videos_grid(sample, "samples/output.mp4", fps=8)
130
+ ```
131
+
132
+ ## Limitations
133
+ 1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
134
+ The model trickly learns some shortcuts (by adding artifacts in the background) to increase the reward (i.e., reward hacking).
135
+ 2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
136
+ evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
137
+ in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
138
+
139
+ ## Reference
140
+ <ol>
141
+ <li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
142
+ <li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
143
+ </ol>