File size: 3,689 Bytes
b3f324b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# Evaluate the generated videos quality
You can easily calculate the following video quality metrics, which supports the batch-wise process.
- **CLIP-SCORE**: It uses the pretrained CLIP model to measure the cosine similarity between two modalities.
- **FVD**: FrechΓ©t Video Distance
- **SSIM**: structural similarity index measure
- **LPIPS**: learned perceptual image patch similarity
- **PSNR**: peak-signal-to-noise ratio
# Requirement
## Environment
- install Pytorch (torch>=1.7.1)
- install CLIP
```
pip install git+https://github.com/openai/CLIP.git
```
- install clip-cose from PyPi
```
pip install clip-score
```
- Other package
```
pip install lpips
pip install scipy (scipy==1.7.3/1.9.3, if you use 1.11.3, **you will calculate a WRONG FVD VALUE!!!**)
pip install numpy
pip install pillow
pip install torchvision>=0.8.2
pip install ftfy
pip install regex
pip install tqdm
```
## Pretrain model
- FVD
Before you cacluate FVD, you should first download the FVD pre-trained model. You can manually download any of the following and put it into FVD folder.
- `i3d_torchscript.pt` from [here](https://www.dropbox.com/s/ge9e5ujwgetktms/i3d_torchscript.pt)
- `i3d_pretrained_400.pt` from [here](https://onedrive.live.com/download?cid=78EEF3EB6AE7DBCB&resid=78EEF3EB6AE7DBCB%21199&authkey=AApKdFHPXzWLNyI)
## Other Notices
1. Make sure the pixel value of videos should be in [0, 1].
2. We average SSIM when images have 3 channels, ssim is the only metric extremely sensitive to gray being compared to b/w.
3. Because the i3d model downsamples in the time dimension, `frames_num` should > 10 when calculating FVD, so FVD calculation begins from 10-th frame, like upper example.
4. For grayscale videos, we multiply to 3 channels
5. data input specifications for clip_score
> - Image Files:All images should be stored in a single directory. The image files can be in either .png or .jpg format.
>
> - Text Files: All text data should be contained in plain text files in a separate directory. These text files should have the extension .txt.
>
> Note: The number of files in the image directory should be exactly equal to the number of files in the text directory. Additionally, the files in the image directory and text directory should be paired by file name. For instance, if there is a cat.png in the image directory, there should be a corresponding cat.txt in the text directory.
>
> Directory Structure Example:
> ```
> βββ path/to/image
> β βββ cat.png
> β βββ dog.png
> β βββ bird.jpg
> βββ path/to/text
> βββ cat.txt
> βββ dog.txt
> βββ bird.txt
> ```
6. data input specifications for fvd, psnr, ssim, lpips
> Directory Structure Example:
> ```
> βββ path/to/generated_image
> β βββ cat.mp4
> β βββ dog.mp4
> β βββ bird.mp4
> βββ path/to/real_image
> βββ cat.mp4
> βββ dog.mp4
> βββ bird.mp4
> ```
# Usage
```
# you change the file path and need to set the frame_num, resolution etc...
# clip_score cross modality
cd opensora/eval
bash script/cal_clip_score.sh
# fvd
cd opensora/eval
bash script/cal_fvd.sh
# psnr
cd opensora/eval
bash eval/script/cal_psnr.sh
# ssim
cd opensora/eval
bash eval/script/cal_ssim.sh
# lpips
cd opensora/eval
bash eval/script/cal_lpips.sh
```
# Acknowledgement
The evaluation codebase refers to [clip-score](https://github.com/Taited/clip-score) and [common_metrics](https://github.com/JunyaoHu/common_metrics_on_video_quality). |