Visual Question Answering
English
mhan commited on
Commit
c13efb5
1 Parent(s): ff853c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -1
README.md CHANGED
@@ -7,4 +7,51 @@ language:
7
  metrics:
8
  - bleu
9
  pipeline_tag: visual-question-answering
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  metrics:
8
  - bleu
9
  pipeline_tag: visual-question-answering
10
+ ---
11
+
12
+ # Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
13
+
14
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641ae9911911d3be67422e6f/0KwEa8cvg0KEq7wLmhpLz.png)
15
+
16
+ ## Dataset Description
17
+
18
+ - **Repository:** [Shot2Story](https://github.com/bytedance/Shot2Story)
19
+ - **Paper:** [2312.10300](https://arxiv.org/abs/2312.10300)
20
+ - **Point of Contact:** mailto:[Mingfei Han](hmf282@gmail.com)
21
+
22
+ **For video data downloading, please have a look at [this issue](https://github.com/bytedance/Shot2Story/issues/5).**
23
+
24
+ We are excited to release a new video-text benchmark for multi-shot video understanding. This release contains a 134k version of our dataset. It includes detailed long summaries (human annotated + GPTV generated) for 134k videos and shot captions (human annotated) for 188k video shots. Please check the dataset [here](https://huggingface.co/datasets/mhan/Shot2Story-134K).
25
+
26
+ ## Models
27
+
28
+ We are releasing the checkpoints trained with our [Shot2Story-20K](https://huggingface.co/datasets/mhan/Shot2Story-20K) and [Shot2Story-134K](https://huggingface.co/datasets/mhan/Shot2Story-134K).
29
+
30
+ - **{20k,134k}-version/sum_shot_best_epoch.pth:** Model tuned on our multi-shot summary data. Used in the config files `ckpt`.
31
+ - **{20k,134k}-version/shot_av_best_epoch.pth:** Model trained on our single-shot caption data. Used in the config files `ckpt`.
32
+ - **transnetv2-pytorch-weights.pth:** Checkpoint used for automatic shot detection method, which is used in the Bot demo. Please following the original license of the TransNetv2.
33
+ - **BLIP.cache.tar:** Cached checkpoints for training, testing and offline demos. This is only to ease the usage case that servers can't access huggingface. Please be restriected the original license to the different models.
34
+
35
+
36
+ ## License <a name="license"></a>
37
+
38
+ Our text annotations are licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](https://creativecommons.org/licenses/by-nc-sa/4.0/). They are available strictly for non-commercial research.
39
+
40
+ Please note, our dataset does not include the original videos. Users must refer to [HD-VILA-100M](https://github.com/microsoft/XPretrain/blob/main/hd-vila-100m/README.md) for video access. By downloading our annotations, you agree to these terms. Respect for video copyright holders is paramount. Ensure your use of the videos aligns with the original source's terms.
41
+
42
+ ---
43
+
44
+ ## Citation <a name="citation"></a>
45
+
46
+ If you find our work useful for your research, please consider citing the paper
47
+
48
+ ```
49
+ @misc{han2023shot2story20k,
50
+ title={Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos},
51
+ author={Mingfei Han and Linjie Yang and Xiaojun Chang and Heng Wang},
52
+ year={2023},
53
+ eprint={2312.10300},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CV}
56
+ }
57
+ ```