TRAIN_AND_VALIDATE.md · LanguageBind/Video-LLaVA at main

Data preparation

data for training

The images pretraining dataset is from LLaVA.
The images tuning dataset is from LLaVA.
The videos pretraining dataset is from Valley.
The videos tuning dataset is from Video-ChatGPT.
Download the training annotations. You can download from Baidu Disk, Google Disk or Peking University Disk

We also provide the processed data as follows.

Datasets	Baidu Disk
Image pretraining	Link
Image tuning	Link
Video pretraining	Link
Video tuning	Link

After downloading all of them, organize the data as follows in DATA_ROOT.

DATA_ROOT
├── llava_image
├── llava_image_tune
├── valley
└── videochatgpt_tune

data for validating

For image, follow LLaVA's instructions. You MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to eval. This also provides a general structure for all datasets.

For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.

Datasets	Baidu Disk	Google Disk	Peking University Disk
Activitynet_Zero_Shot_QA	Link	-	-
MSRVTT_Zero_Shot_QA	Link	Link	-
MSVD_Zero_Shot_QA	Link	Link	Link
TGIF_Zero_Shot_QA	Link	Link	Link

After downloading all of them, organize the data as follows in eval.

eval
├── GPT_Zero_Shot_QA
│   ├── Activitynet_Zero_Shot_QA
│   ├── MSRVTT_Zero_Shot_QA
│   ├── MSVD_Zero_Shot_QA
│   └── TGIF_Zero_Shot_QA
├── gqa
│   ├── answers
│   ├── data
│   └── llava_gqa_testdev_balanced.jsonl
├── llava-bench-in-the-wild
│   ├── answers
│   ├── answers_gpt4.jsonl
│   ├── bard_0718.jsonl
│   ├── bing_chat_0629.jsonl
│   ├── context.jsonl
│   ├── images
│   ├── questions.jsonl
│   ├── README.md
│   └── reviews
├── mmbench
│   ├── answers
│   ├── answers_upload
│   ├── mmbench_dev_20230712.tsv
│   └── mmbench_dev_en_20231003.tsv
├── MME
│   ├── answers
│   ├── convert_answer_to_mme.py
│   └── llava_mme.jsonl
├── mm-vet
│   ├── answers
│   ├── bard_set.json
│   ├── convert_answers.py
│   ├── images
│   ├── llava-mm-vet.jsonl
│   ├── mm-vet.json
│   └── results
├── pope
│   ├── answers
│   ├── coco
│   ├── llava_pope_test.jsonl
│   └── val2014
├── scienceqa
│   ├── answers
│   ├── images
│   ├── llava_test_CQM-A.json
│   ├── pid_splits.json
│   └── problems.json
├── seed_bench
│   ├── answers
│   ├── answers_upload
│   ├── extract_video_frames.py
│   └── llava-seed-bench.jsonl
├── textvqa
│   ├── answers
│   ├── llava_textvqa_val_v051_ocr.jsonl
│   ├── TextVQA_0.5.1_val.json
│   └── train_images
├── vizwiz
│   ├── answers
│   ├── answers_upload
│   ├── llava_test.jsonl
│   ├── test
│   ├── test.json
│   ├── train.json
│   └── val.json
└── vqav2
    ├── answers
    ├── answers_upload
    ├── llava_vqav2_mscoco_test2015.jsonl
    ├── llava_vqav2_mscoco_test-dev2015.jsonl
    └── test2015

Training

Specify your DATA_ROOT according to the data preparation.

Stage 1 pretraining script: pretrain.sh.
Stage 2 tuning script: finetune.sh.

Validating

Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution!

You can refer to the official repository for validation, but we also provide off-the-shelf scripts.

MSRVTT-QA

Inference to get the result.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh

GPT-Assistant evaluation.

bash scripts/v1_5/eval/eval_qa_msrvtt.sh

MSVD-QA

Inference to get the result.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh

GPT-Assistant evaluation.

bash scripts/v1_5/eval/eval_qa_msvd.sh

TGIF-QA

Inference to get the result.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh

GPT-Assistant evaluation.

bash scripts/v1_5/eval/eval_qa_tgif.sh

ActivityNet-QA

Inference to get the result.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh

GPT-Assistant evaluation.

bash scripts/v1_5/eval/eval_qa_activitynet.sh

VQAv2

Download test2015 and put it under eval/vqav2.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh

Submit the results to the evaluation server: eval/vqav2/answers_upload.

GQA

Download the data following the official instructions here and put under eval/gqa/data.
Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh

VisWiz

Download test.json and extract test.zip to test. Put them under eval/vizwiz.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh

Submit the results to the evaluation server: eval/vizwiz/answers_upload.

ScienceQA

Under eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to eval/textvqa.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh

POPE

Download coco from POPE and put under eval/pope.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh

MMBench

Download mmbench_dev_20230712.tsv and put under eval/mmbench.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh

Submit the results to the evaluation server: eval/mmbench/answers_upload/mmbench_dev_20230712.

LLaVA-Bench-in-the-Wild

Extract contents of llava-bench-in-the-wild to eval/llava-bench-in-the-wild.
Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh

MM-Vet

Extract mm-vet.zip to eval/mmvet.
Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh