Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
license: mit
|
4 |
+
tags:
|
5 |
+
- vision
|
6 |
+
- video-classification
|
7 |
+
model-index:
|
8 |
+
- name: nielsr/xclip-base-patch16-kinetics-600-16-frames
|
9 |
+
results:
|
10 |
+
- task:
|
11 |
+
type: video-classification
|
12 |
+
dataset:
|
13 |
+
name: Kinetics 400
|
14 |
+
type: kinetics-400
|
15 |
+
metrics:
|
16 |
+
- type: top-1 accuracy
|
17 |
+
value: 85.8
|
18 |
+
- type: top-5 accuracy
|
19 |
+
value: 97.3
|
20 |
+
---
|
21 |
+
|
22 |
+
# X-CLIP (base-sized model)
|
23 |
+
|
24 |
+
X-CLIP model (base-sized, patch resolution of 16) trained fully-supervised on [Kinetics-600](https://www.deepmind.com/open-source/kinetics). It was introduced in the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
|
25 |
+
|
26 |
+
This model was trained using 16 frames per video, at a resolution of 224x224.
|
27 |
+
|
28 |
+
Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
|
29 |
+
|
30 |
+
## Model description
|
31 |
+
|
32 |
+
X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.
|
33 |
+
|
34 |
+
![X-CLIP architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png)
|
35 |
+
|
36 |
+
This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.
|
37 |
+
|
38 |
+
## Intended uses & limitations
|
39 |
+
|
40 |
+
You can use the raw model for determining how well text goes with a given video. See the [model hub](https://huggingface.co/models?search=microsoft/xclip) to look for
|
41 |
+
fine-tuned versions on a task that interests you.
|
42 |
+
|
43 |
+
### How to use
|
44 |
+
|
45 |
+
For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/xclip.html#).
|
46 |
+
|
47 |
+
## Training data
|
48 |
+
|
49 |
+
This model was trained on [Kinetics-600](https://www.deepmind.com/open-source/kinetics).
|
50 |
+
|
51 |
+
### Preprocessing
|
52 |
+
|
53 |
+
The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L247).
|
54 |
+
|
55 |
+
The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L285).
|
56 |
+
|
57 |
+
During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.
|
58 |
+
|
59 |
+
## Evaluation results
|
60 |
+
|
61 |
+
This model achieves a top-1 accuracy of 85.8% and a top-5 accuracy of 97.3%.
|