arxiv:1906.02467

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Published on Jun 6, 2019

Authors:

Abstract

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce <PRE_TAG>ActivityNet-QA</POST_TAG>, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our <PRE_TAG>ActivityNet-QA</POST_TAG> dataset and conduct extensive experiments on it by comparing existing <PRE_TAG>VideoQA baselines</POST_TAG>. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos. The dataset is available at https://github.com/MILVLG/activitynet-qa

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 43

Browse 43 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/1906.02467 in a dataset README.md to link it from this page.

Spaces citing this paper 16

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.