Spaces:
Running
on
Zero
Running
on
Zero
Alex Ergasti
commited on
Commit
·
7d1261a
1
Parent(s):
67d1f09
Update readme
Browse files
README.md
CHANGED
@@ -1,149 +1,7 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
In our paper we explores three different model configuration, illustrated here:
|
11 |
-
<center><img src="imgs/blocks.png" alt="drawing" width="800"/></center>
|
12 |
-
|
13 |
-
## Results
|
14 |
-
|
15 |
-
### Examples of short video generated on AIST and Landscape
|
16 |
-
|
17 |
-
https://github.com/user-attachments/assets/d7d544d0-7c62-4870-b783-4f0efa8eebee
|
18 |
-
|
19 |
-
https://github.com/user-attachments/assets/aa6e0dfa-cbee-4127-b4e6-c96386cc0870
|
20 |
-
|
21 |
-
https://github.com/user-attachments/assets/027f8a5a-ba7f-404b-863b-f3fabbcad9a6
|
22 |
-
|
23 |
-
https://github.com/user-attachments/assets/0cbbdd84-393d-4d7b-af82-537a4398d2d1
|
24 |
-
|
25 |
-
|
26 |
-
### Examples of long video generated on AIST
|
27 |
-
|
28 |
-
https://github.com/user-attachments/assets/233661cd-1cc0-4759-83be-faff0c988151
|
29 |
-
|
30 |
-
https://github.com/user-attachments/assets/5223acf3-04bc-4d34-924d-c7483e07f1e2
|
31 |
-
|
32 |
-
## Setup
|
33 |
-
|
34 |
-
Create conda env:
|
35 |
-
```bash
|
36 |
-
conda create -y -n FLAV python=3.12
|
37 |
-
conda activate FLAV
|
38 |
-
conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
|
39 |
-
pip install pysoundfile transformers diffusers einops accelerate librosa timm
|
40 |
-
pip install onnx onnxruntime onnxsim omegaconf
|
41 |
-
pip install moviepy
|
42 |
-
pip install pyav
|
43 |
-
pip install git+https://github.com/facebookresearch/segment-anything.git
|
44 |
-
```
|
45 |
-
|
46 |
-
## Inference
|
47 |
-
Will be published soon.
|
48 |
-
|
49 |
-
Command line options should be the same as the loaded model (eg. num classes, predicted frames ecc.) to avoid loading errors:
|
50 |
-
```bash
|
51 |
-
python sample-metrics.py \
|
52 |
-
--model FLAV-B/1 \
|
53 |
-
--data-path <datapath> \
|
54 |
-
--batch-size 32 --num-classes <classes> \
|
55 |
-
--image-size 256 \
|
56 |
-
--experiment-dir <exp-dir>\
|
57 |
-
--results-dir results \
|
58 |
-
--video-length 16 \
|
59 |
-
--predict-frames 10 \
|
60 |
-
--causal-attn \
|
61 |
-
--num-videos 2048 \
|
62 |
-
--audio-scale <audio-scale> \
|
63 |
-
--num-workers 16 \
|
64 |
-
--num_timesteps 20 \
|
65 |
-
--use_sd_vae \
|
66 |
-
--ignore-cache --vocoder-ckpt <vocoder-ckpt>
|
67 |
-
```
|
68 |
-
|
69 |
-
Where `<exp-dir>` is:
|
70 |
-
```
|
71 |
-
└──checkpoint
|
72 |
-
└──ema.pth
|
73 |
-
```
|
74 |
-
|
75 |
-
Where `<vocoder-ckpt>` is:
|
76 |
-
```
|
77 |
-
└──vocoder
|
78 |
-
├──config.json
|
79 |
-
└──vocoder.pt
|
80 |
-
```
|
81 |
-
|
82 |
-
## Training
|
83 |
-
```bash
|
84 |
-
accelerate launch --multi_gpu --num_processes=... \
|
85 |
-
train.py \
|
86 |
-
--model FLAV-B/1 \
|
87 |
-
--data-path <datapath> \
|
88 |
-
--image-size 256 \
|
89 |
-
--batch-size 16 --num-classes <classes> \
|
90 |
-
--experiment-dir <experiment-dir> \
|
91 |
-
--results-dir results/ \
|
92 |
-
--sample-every 20000 \
|
93 |
-
--ckpt-every 5000 \
|
94 |
-
--log-every 100 \
|
95 |
-
--video-length 50 \
|
96 |
-
--predict-frames 10 \
|
97 |
-
--sampling logit \
|
98 |
-
--num-workers 16 \
|
99 |
-
--grad-ckpt \
|
100 |
-
--causal-attn \
|
101 |
-
--use_sd_vae \
|
102 |
-
--audio-scale <audio-scale>
|
103 |
-
```
|
104 |
-
|
105 |
-
Where `<datapath>` is the dataset folder organised as follow:
|
106 |
-
|
107 |
-
If the dataset does not have classes:
|
108 |
-
```
|
109 |
-
dataset-folder:
|
110 |
-
├──train
|
111 |
-
│ ├──file1.mp4
|
112 |
-
│ ├──file2.mp4
|
113 |
-
└──test
|
114 |
-
└──file3.mp4
|
115 |
-
```
|
116 |
-
If the dataset does have classes:
|
117 |
-
```
|
118 |
-
dataset-folder:
|
119 |
-
├──train
|
120 |
-
│ ├──class0
|
121 |
-
│ │ └──file1.mp4
|
122 |
-
│ └──class1
|
123 |
-
│ └──file2.mp4
|
124 |
-
└──test
|
125 |
-
├──class0
|
126 |
-
│ └──file3.mp4
|
127 |
-
└──class1
|
128 |
-
└──file4.mp4
|
129 |
-
```
|
130 |
-
`<classes>` is the number of classes in the dataset.
|
131 |
-
|
132 |
-
`<audio-scale>` is:
|
133 |
-
- For AIST++: `3.5009668382765917`
|
134 |
-
- For landscape: `3.0951129410195515`
|
135 |
-
|
136 |
-
|
137 |
-
## Citation
|
138 |
-
|
139 |
-
```
|
140 |
-
@misc{ergasti2025rflavrollingflowmatching,
|
141 |
-
title={$^R$FLAV: Rolling Flow matching for infinite Audio Video generation},
|
142 |
-
author={Alex Ergasti and Giuseppe Gabriele Tarollo and Filippo Botti and Tomaso Fontanini and Claudio Ferrari and Massimo Bertozzi and Andrea Prati},
|
143 |
-
year={2025},
|
144 |
-
eprint={2503.08307},
|
145 |
-
archivePrefix={arXiv},
|
146 |
-
primaryClass={cs.CV},
|
147 |
-
url={https://arxiv.org/abs/2503.08307},
|
148 |
-
}
|
149 |
-
```
|
|
|
1 |
+
---
|
2 |
+
title: R-FLAV
|
3 |
+
sdk: gradio
|
4 |
+
sdk_version: "5.20.1"
|
5 |
+
app_file: app.py
|
6 |
+
pinned: false
|
7 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|