Alex Ergasti commited on
Commit
7d1261a
·
1 Parent(s): 67d1f09

Update readme

Browse files
Files changed (1) hide show
  1. README.md +7 -149
README.md CHANGED
@@ -1,149 +1,7 @@
1
- # $^R$-FLAV: Rolling Flow matching for infinite Audio Video generation
2
-
3
- This is the official implementation of
4
-
5
- An overview of our models is shown here: $^R$-FLAV: Rolling Flow mathcing for infinite Audio Video generation.
6
-
7
- <center><img src="imgs/FLAV.png" alt="drawing" width="500"/></center>
8
-
9
-
10
- In our paper we explores three different model configuration, illustrated here:
11
- <center><img src="imgs/blocks.png" alt="drawing" width="800"/></center>
12
-
13
- ## Results
14
-
15
- ### Examples of short video generated on AIST and Landscape
16
-
17
- https://github.com/user-attachments/assets/d7d544d0-7c62-4870-b783-4f0efa8eebee
18
-
19
- https://github.com/user-attachments/assets/aa6e0dfa-cbee-4127-b4e6-c96386cc0870
20
-
21
- https://github.com/user-attachments/assets/027f8a5a-ba7f-404b-863b-f3fabbcad9a6
22
-
23
- https://github.com/user-attachments/assets/0cbbdd84-393d-4d7b-af82-537a4398d2d1
24
-
25
-
26
- ### Examples of long video generated on AIST
27
-
28
- https://github.com/user-attachments/assets/233661cd-1cc0-4759-83be-faff0c988151
29
-
30
- https://github.com/user-attachments/assets/5223acf3-04bc-4d34-924d-c7483e07f1e2
31
-
32
- ## Setup
33
-
34
- Create conda env:
35
- ```bash
36
- conda create -y -n FLAV python=3.12
37
- conda activate FLAV
38
- conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
39
- pip install pysoundfile transformers diffusers einops accelerate librosa timm
40
- pip install onnx onnxruntime onnxsim omegaconf
41
- pip install moviepy
42
- pip install pyav
43
- pip install git+https://github.com/facebookresearch/segment-anything.git
44
- ```
45
-
46
- ## Inference
47
- Will be published soon.
48
-
49
- Command line options should be the same as the loaded model (eg. num classes, predicted frames ecc.) to avoid loading errors:
50
- ```bash
51
- python sample-metrics.py \
52
- --model FLAV-B/1 \
53
- --data-path <datapath> \
54
- --batch-size 32 --num-classes <classes> \
55
- --image-size 256 \
56
- --experiment-dir <exp-dir>\
57
- --results-dir results \
58
- --video-length 16 \
59
- --predict-frames 10 \
60
- --causal-attn \
61
- --num-videos 2048 \
62
- --audio-scale <audio-scale> \
63
- --num-workers 16 \
64
- --num_timesteps 20 \
65
- --use_sd_vae \
66
- --ignore-cache --vocoder-ckpt <vocoder-ckpt>
67
- ```
68
-
69
- Where `<exp-dir>` is:
70
- ```
71
- └──checkpoint
72
- └──ema.pth
73
- ```
74
-
75
- Where `<vocoder-ckpt>` is:
76
- ```
77
- └──vocoder
78
- ├──config.json
79
- └──vocoder.pt
80
- ```
81
-
82
- ## Training
83
- ```bash
84
- accelerate launch --multi_gpu --num_processes=... \
85
- train.py \
86
- --model FLAV-B/1 \
87
- --data-path <datapath> \
88
- --image-size 256 \
89
- --batch-size 16 --num-classes <classes> \
90
- --experiment-dir <experiment-dir> \
91
- --results-dir results/ \
92
- --sample-every 20000 \
93
- --ckpt-every 5000 \
94
- --log-every 100 \
95
- --video-length 50 \
96
- --predict-frames 10 \
97
- --sampling logit \
98
- --num-workers 16 \
99
- --grad-ckpt \
100
- --causal-attn \
101
- --use_sd_vae \
102
- --audio-scale <audio-scale>
103
- ```
104
-
105
- Where `<datapath>` is the dataset folder organised as follow:
106
-
107
- If the dataset does not have classes:
108
- ```
109
- dataset-folder:
110
- ├──train
111
- │ ├──file1.mp4
112
- │ ├──file2.mp4
113
- └──test
114
- └──file3.mp4
115
- ```
116
- If the dataset does have classes:
117
- ```
118
- dataset-folder:
119
- ├──train
120
- │ ├──class0
121
- │ │ └──file1.mp4
122
- │ └──class1
123
- │ └──file2.mp4
124
- └──test
125
- ├──class0
126
- │ └──file3.mp4
127
- └──class1
128
- └──file4.mp4
129
- ```
130
- `<classes>` is the number of classes in the dataset.
131
-
132
- `<audio-scale>` is:
133
- - For AIST++: `3.5009668382765917`
134
- - For landscape: `3.0951129410195515`
135
-
136
-
137
- ## Citation
138
-
139
- ```
140
- @misc{ergasti2025rflavrollingflowmatching,
141
- title={$^R$FLAV: Rolling Flow matching for infinite Audio Video generation},
142
- author={Alex Ergasti and Giuseppe Gabriele Tarollo and Filippo Botti and Tomaso Fontanini and Claudio Ferrari and Massimo Bertozzi and Andrea Prati},
143
- year={2025},
144
- eprint={2503.08307},
145
- archivePrefix={arXiv},
146
- primaryClass={cs.CV},
147
- url={https://arxiv.org/abs/2503.08307},
148
- }
149
- ```
 
1
+ ---
2
+ title: R-FLAV
3
+ sdk: gradio
4
+ sdk_version: "5.20.1"
5
+ app_file: app.py
6
+ pinned: false
7
+ ---