File size: 5,308 Bytes
f68ddd5 d0594e7 f68ddd5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: cc
---
# Localized Audio Visual DeepFake Dataset (LAV-DF)
This repo is the official PyTorch implementation for the DICTA paper [Do You Really Mean That? Content Driven Audio-Visual
Deepfake Dataset and Multimodal Method for Temporal Forgery Localization](https://ieeexplore.ieee.org/document/10034605)
(Best Award), and the journal paper [_Glitch in the Matrix_: A Large Scale Benchmark for Content Driven Audio-Visual
Forgery Detection and Localization](https://www.sciencedirect.com/science/article/pii/S1077314223001984) accepted by CVIU.
## LAV-DF Dataset
### Download
To use this LAV-DF dataset, you should agree the [terms and conditions](https://github.com/ControlNet/LAV-DF/blob/master/TERMS_AND_CONDITIONS.md).
Download link: [OneDrive](https://monashuni-my.sharepoint.com/:f:/g/personal/zhixi_cai_monash_edu/EklD-8lD_GRNl0yyJJ-cF3kBWEiHRmH4U5Dtg7eJjAOUlg?e=wowDpd), [Google Drive](https://drive.google.com/drive/folders/1U8asIMb0bpH6-zMR_5FaJmPnC53lomq7?usp=sharing), [HuggingFace](https://huggingface.co/datasets/ControlNet/LAV-DF).
### Baseline Benchmark
| Method | AP@0.5 | AP@0.75 | AP@0.95 | AR@100 | AR@50 | AR@20 | AR@10 |
|---------|--------|---------|---------|--------|-------|-------|-------|
| BA-TFD | 79.15 | 38.57 | 00.24 | 67.03 | 64.18 | 60.89 | 58.51 |
| BA-TFD+ | 96.30 | 84.96 | 04.44 | 81.62 | 80.48 | 79.40 | 78.75 |
Please note this result of BA-TFD is slightly better than the one reported in the paper.
This is because we have used the better hyperparameters in this repository.
## Baseline Models
### Requirements
The main versions are,
- Python >= 3.7, < 3.11
- PyTorch >= 1.13
- torchvision >= 0.14
- pytorch_lightning == 1.7.*
Run the following command to install the required packages.
```bash
pip install -r requirements.txt
```
### Training BA-TFD
Train the BA-TFD introduced in paper [Do You Really Mean That? Content Driven Audio-Visual
Deepfake Dataset and Multimodal Method for Temporal Forgery Localization](https://ieeexplore.ieee.org/document/10034605) with default hyperparameter on LAV-DF dataset.
```bash
python train.py \
--config ./config/batfd_default.toml \
--data_root <DATASET_PATH> \
--batch_size 4 --num_workers 8 --gpus 1 --precision 16
```
The checkpoint will be saved in `ckpt` directory, and the tensorboard log will be saved in `lighntning_logs` directory.
### Training BA-TFD+
Train the BA-TFD+ introduced in paper [_Glitch in the Matrix_: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization](https://www.sciencedirect.com/science/article/pii/S1077314223001984) with default hyperparameter on LAV-DF dataset.
```bash
python train.py \
--config ./config/batfd_plus_default.toml \
--data_root <DATASET_PATH> \
--batch_size 4 --num_workers 8 --gpus 2 --precision 32
```
Please use `FP32` for training BA-TFD+ as `FP16` will cause inf and nan.
The checkpoint will be saved in `ckpt` directory, and the tensorboard log will be saved in `lighntning_logs` directory.
### Evaluation
Please run the following command to evaluate the model with the checkpoint saved in `ckpt` directory.
Besides, you can also download the [BA-TFD](https://github.com/ControlNet/LAV-DF/releases/download/pretrained_model/batfd_default.ckpt) and [BA-TFD+](https://github.com/ControlNet/LAV-DF/releases/download/pretrained_model_v2/batfd_plus_default.ckpt) pretrained models.
```bash
python evaluate.py \
--config <CONFIG_PATH> \
--data_root <DATASET_PATH> \
--checkpoint <CHECKPOINT_PATH> \
--batch_size 1 --num_workers 4
```
In the script, there will be a temporal inference results generated in `output` directory, and the AP and AR scores will
be printed in the console.
Note please make sure only one GPU is visible to the evaluation script.
## License
This project is under the CC BY-NC 4.0 license. See [LICENSE](LICENSE) for details.
## References
If you find this work useful in your research, please cite them.
The conference paper,
```bibtex
@inproceedings{cai2022you,
title = {Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization},
author = {Cai, Zhixi and Stefanov, Kalin and Dhall, Abhinav and Hayat, Munawar},
booktitle = {2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)},
year = {2022},
doi = {10.1109/DICTA56598.2022.10034605},
pages = {1--10},
address = {Sydney, Australia},
}
```
The extended journal version is accepted by CVIU,
```bibtex
@article{cai2023glitch,
title = {Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization},
author = {Cai, Zhixi and Ghosh, Shreya and Dhall, Abhinav and Gedeon, Tom and Stefanov, Kalin and Hayat, Munawar},
journal = {Computer Vision and Image Understanding},
year = {2023},
volume = {236},
pages = {103818},
issn = {1077-3142},
doi = {10.1016/j.cviu.2023.103818},
}
```
## Acknowledgements
Some code related to boundary matching mechanism is borrowed from
[JJBOY/BMN-Boundary-Matching-Network](https://github.com/JJBOY/BMN-Boundary-Matching-Network) and
[xxcheng0708/BSNPlusPlus-boundary-sensitive-network](https://github.com/xxcheng0708/BSNPlusPlus-boundary-sensitive-network). |