eduardo-alvarez commited on
Commit
9351e9f
1 Parent(s): ead7ef1

Enriching model card for improved discoverability and consumption (#1)

Browse files

- Enriching model card for improved discoverability and consumption (51af3ebcce9efc05f8e775001ccfe874f6157a1e)

Files changed (1) hide show
  1. README.md +51 -23
README.md CHANGED
@@ -2,11 +2,59 @@
2
  language: en
3
  tags:
4
  - tvp
 
 
 
5
  license: other
6
  datasets:
7
  - charades
 
8
  ---
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  # TVP base model
11
 
12
  The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
@@ -156,29 +204,6 @@ start, end = round(timestamp[0][0]*duration, 1), round(timestamp[0][1]*duration,
156
  print(f"The time slot of the video corresponding to the text \"{text}\" is from {start}s to {end}s")
157
  ```
158
 
159
- ### Limitations and bias
160
-
161
- TODO
162
-
163
- ## Training data
164
-
165
- The TVP model was pretrained on public datasets:
166
- - [charades](https://prior.allenai.org/projects/charades),
167
-
168
- ## Training procedure
169
-
170
- ### Preprocessing
171
-
172
- TODO
173
-
174
- ### Pretraining
175
-
176
- TODO
177
-
178
- ## Evaluation results
179
-
180
- Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task.
181
-
182
  ### BibTeX entry and citation info
183
  ```bibtex
184
  @inproceedings{zhang2023text,
@@ -189,3 +214,6 @@ Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's perfor
189
  year={2023}
190
  }
191
  ```
 
 
 
 
2
  language: en
3
  tags:
4
  - tvp
5
+ - intel
6
+ - cvpr
7
+ - charades
8
  license: other
9
  datasets:
10
  - charades
11
+ library_name: transformers
12
  ---
13
 
14
+ # TVP base model
15
+
16
+ | Model Detail | Description |
17
+ | ----------- | ----------- |
18
+ | Model Authors | Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding |
19
+ | Date | 2023 |
20
+ | Version | Base |
21
+ | Type | Text-Visual Prompting for Temporal Video Grounding |
22
+ | Paper or Other Resources | Base model: [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b); Dataset: [Charades](https://prior.allenai.org/projects/charades) |
23
+ | License | Other |
24
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/tvp-base/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|
25
+
26
+ | Intended Use | Description |
27
+ | ----------- | ----------- |
28
+ | Primary intended uses | The TVP model is designed for temporal video grounding (TVG), specifically to predict the start and end times of moments described by a text sentence within a long, untrimmed video. |
29
+ | Primary intended users | Researchers and developers working in the field of computer vision, particularly those focused on video understanding and cross-modal (text and video) tasks. |
30
+ | Out-of-scope uses | The model is not intended for real-time video processing or applications requiring 3D visual features extraction due to its design for efficiency with 2D features. |
31
+
32
+
33
+ # Factors
34
+ Relevant factors: The model's performance may vary across different video content, such as variations in video quality, lighting conditions, or genres (e.g., action vs. dialogue-heavy scenes).
35
+ Evaluation factors: Performance has been evaluated on benchmark datasets like Charades-STA and ActivityNet Captions, focusing on metrics relevant to temporal video grounding accuracy.
36
+
37
+ # Metrics
38
+
39
+ Model performance measures: The model employs metrics such as the Temporal-Distance IoU (TDIoU) loss for efficient learning and performance evaluation in TVG tasks.
40
+
41
+ Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.
42
+
43
+ # Training Data
44
+
45
+ The TVP model was pretrained on public datasets such as Charades.
46
+
47
+ Charades is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades). The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. This work was presented at ECCV2016.
48
+
49
+ Each video has been exhaustively annotated using consensus from 4 workers on the training set, and from 8 workers on the test set. Please refer to the updated accompanying publication for details. Please contact vision.amt@allenai.org for questions about the dataset.
50
+
51
+ # Quantitative Analyses
52
+
53
+ Unitary results: Refer to Table 2 in the provided paper for TVP's performance on the Temporal Video Grounding task.
54
+
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e1cfa7f9927d9455acdc72/WOeve3VDZU2WvoXfvoK5X.png)
56
+
57
+
58
  # TVP base model
59
 
60
  The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
 
204
  print(f"The time slot of the video corresponding to the text \"{text}\" is from {start}s to {end}s")
205
  ```
206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  ### BibTeX entry and citation info
208
  ```bibtex
209
  @inproceedings{zhang2023text,
 
214
  year={2023}
215
  }
216
  ```
217
+
218
+ Disclaimer
219
+ The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.