File size: 7,780 Bytes
d373fa0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# CogVideoX-Fun-V1.5-Reward-LoRAs
## Introduction
We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [CogVideoX-Fun-V1.5](https://github.com/aigc-apps/CogVideoX-Fun) for better alignment with human preferences.
We provide the following pre-trained models (i.e. LoRAs) along with [the training script](https://github.com/aigc-apps/CogVideoX-Fun/blob/main/scripts/train_reward_lora.py). You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/CogVideoX-Fun).
| Name | Base Model | Reward Model | Hugging Face | Description |
|--|--|--|--|--|
| CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors | [CogVideoX-Fun-V1.5-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 1,500 steps.|
| CogVideoX-Fun-V1.5-5b-InP-MPS.safetensors | [CogVideoX-Fun-V1.5-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.5-5b-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 5,500 steps.|
## Demo
### CogVideoX-Fun-V1.5-5B
<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
<thead>
<tr>
<th style="text-align: center;" width="10%">Prompt</sup></th>
<th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B</th>
<th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B <br> HPSv2.1 Reward LoRA</th>
<th style="text-align: center;" width="30%">CogVideoX-Fun-V1.5-5B <br> MPS Reward LoRA</th>
</tr>
</thead>
<tr>
<td>
A panda eats bamboo while a monkey swings from branch to branch
</td>
<td>
<video src="https://github.com/user-attachments/assets/ec752b06-cb13-4f9d-9c47-260536deba49" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/537a923c-fb64-474d-bbfb-c8ddf502a212" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/6bb3b860-57d3-4ac3-8898-b72b40753f2f" width="100%" controls autoplay loop></video>
</td>
</tr>
<tr>
<td>
A penguin waddles on the ice, a camel treks by
</td>
<td>
<video src="https://github.com/user-attachments/assets/ad551233-5acf-4974-91cc-cd18591acbf4" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/2763fe09-436b-4407-9e6d-385518e1720c" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/19b93c29-5e7b-414f-914d-ae010f8faf29" width="100%" controls autoplay loop></video>
</td>
</tr>
<tr>
<td>
Elderly artist with a white beard painting on a white canvas
</td>
<td>
<video src="https://github.com/user-attachments/assets/3560f91f-c68f-4567-a880-e3297464fb89" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/abbf827c-41e3-4e8b-9771-2f3b788985ca" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/328c85ce-1d22-428d-bf6d-1152d0457563" width="100%" controls autoplay loop></video>
</td>
</tr>
<tr>
<td>
Crystal cake shimmering beside a metal apple
</td>
<td>
<video src="https://github.com/user-attachments/assets/a94c74d3-8b75-41c3-9b21-0d53f9c67781" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/c9509e81-8bf7-4023-b8dd-1a3f7e5def3a" width="100%" controls autoplay loop></video>
</td>
<td>
<video src="https://github.com/user-attachments/assets/37157443-0cc7-4371-9f24-ec228124c206" width="100%" controls autoplay loop></video>
</td>
</tr>
</table>
> [!NOTE]
> The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.
## Quick Start
We provide a simple inference code to run CogVideoX-Fun-V1.5-5b-InP with its HPS2.1 reward LoRA.
```python
import torch
from diffusers import CogVideoXDDIMScheduler
from cogvideox.models.transformer3d import CogVideoXTransformer3DModel
from cogvideox.pipeline.pipeline_cogvideox_inpaint import CogVideoX_Fun_Pipeline_Inpaint
from cogvideox.utils.lora_utils import merge_lora
from cogvideox.utils.utils import get_image_to_video_latent, save_videos_grid
model_path = "alibaba-pai/CogVideoX-Fun-V1.5-5b-InP"
lora_path = "alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors"
lora_weight = 0.7
prompt = "Pig with wings flying above a diamond mountain"
sample_size = [512, 512]
video_length = 85
transformer = CogVideoXTransformer3DModel.from_pretrained_2d(model_path, subfolder="transformer").to(torch.bfloat16)
scheduler = CogVideoXDDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
pipeline = CogVideoX_Fun_Pipeline_Inpaint.from_pretrained(
model_path, transformer=transformer, scheduler=scheduler, torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()
pipeline = merge_lora(pipeline, lora_path, lora_weight)
generator = torch.Generator(device="cuda").manual_seed(42)
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
sample = pipeline(
prompt,
num_frames = video_length,
negative_prompt = "bad detailed",
height = sample_size[0],
width = sample_size[1],
generator = generator,
guidance_scale = 7.0,
num_inference_steps = 50,
video = input_video,
mask_video = input_video_mask,
).videos
save_videos_grid(sample, "samples/output.mp4", fps=8)
```
## Limitations
1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
The model trickly learns some shortcuts (by adding artifacts in the background) to increase the reward (i.e., reward hacking).
2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
## Reference
<ol>
<li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
<li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
</ol> |