Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination
ICLR 2025
Pritam Sarkar
Sayna Ebrahimi
Ali Etemad
Ahmad Beirami
Sercan O Arik
Tomas Pfister
[arXiV] [OpenReview] [GitHub] [Model Weights 🤗] [Training Data]
Please see our GitHUb repo for details.
Setup environment
conda create -n halva python=3.10 -y
conda activate halva
pip install --upgrade pip
pip install -r req.txt
module load cuda/11.7.1
pip install flash-attn --no-build-isolation
Try HALVA!
We share a minimal setup to quickly try our HALVA! See this notebook.
Model weights
Training HALVA
Data
generative data augmented contrastive samples
- Vision-language instructions and their correct and hallucinated responses are available here: data
- Download the images from Visual Genome and save both part 1 and part 2 as
data/vg/VG_100K
anddata/vg/VG_100K_2
reference samples
- A random subset from llava_v1_5_mix665k.json. For reproducibility, we share the actual subset that has been used in our study: ref data
- Image sources:
- MSCOCO - download them as
data/MSCOCO2017
- TextVQA - download them as
data/textvqa
- GQA - download them as
data/gqa
- OCR-VQA - download them as
data/ocr_vqa
- MSCOCO - download them as
Train
- The base model LLaVA-v1.5 weights can be found here: 7B and 13B.
- We use 4-A100 80GB GPUs for training, which takes 1.5 hours and 3 hours for training 7B and 13B variants, respectively. If you are using different GPUs, please make sure to match our default batch_size x gradient accumulation steps, for optimal performance with the default hyperparameters.
- The following training script can be used to train HALVA that uses LLaVA 1.5 as the base model:
- HALVA-7B:
src/hallava_7b.sh
- HALVA-13B:
src/hallava_13b.sh
- HALVA-7B:
Evaluation on hallucination benchmarks
Choose the HALVA variant and their base model. We provide sample validation scripts for evaluation, please make sure to update the paths based on your setup.
MODEL="halva13b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-13b"
# OR
MODEL="halva7b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-7b"
CHAIR
- Download the validation images from MSCOCO2014 and store them as
data/MSCOCO2014/val2014
. We use the same 500 images for validation, as used in prior work. - You can use the given sample script for evaluation.
##### run chair
bash src/evaluate_hall/chair.sh ${MODEL} ${MODEL_BASE}
MME-Hall
- MME-Hall is a subset of MME consisting of
existence
,count
,position
, andcolor
. - You can follow the official instructions for MME evaluation: link and download the MME benchmark.
- Once the data is downloaded you can use the given sample script for evaluation.
##### run mme
bash src/evaluate_hall/mme.sh ${MODEL} ${MODEL_BASE}
AMBER
- Download the validation images are from the source repo AMBER and keep them as
data/amber/image/
. - Download the annotation data directory and save as
eval_hall/amber/data
. - Once the data is downloaded you can use the given sample script for evaluation.
##### run amber evaluation on 4 GPUs in parallel if available, else run sequentially by removing & from the end
bash src/evaluate_hall/amber.sh g ${MODEL} ${MODEL_BASE} 0 &
bash src/evaluate_hall/amber.sh da ${MODEL} ${MODEL_BASE} 1 &
bash src/evaluate_hall/amber.sh dr ${MODEL} ${MODEL_BASE} 2 &
bash src/evaluate_hall/amber.sh de ${MODEL} ${MODEL_BASE} 3 &
wait
# get amber f1 for all discriminative tasks
bash src/evaluate_hall/amber_f1.sh ${MODEL}
MMHal-Bench
- The validation data will be directly downloaded from HuggingFace. You can use the given sample script for evaluation.
##### run mmhal-bench
bash src/evaluate_hall/mmhal.sh ${MODEL} ${MODEL_BASE} 0
HallusionBench
- Download the validation images from link and save them in
data/hallusion_bench
. - Download the annotation files from link and save them in
eval_hall/hallusion_bench
. - For more details, you can check the official repo. You can use the given sample script for evaluation.
##### run halusion-bench
bash src/evaluate_hall/hallusionbench.sh ${MODEL} ${MODEL_BASE} 0
Evaluation on general vision-language tasks
In addition to the above-mentioned evaluation on hallucination benchmarks, we also evaluate on general vision-language benchmarks. For those, we directly follow LLaVA repo as follows:
VILA
The above instructions are mainly related to the LLaVA 1.5 based checkpoints, you can find the VILA codes inside *_vila
directories.
Citation
If you find this repository useful, please consider giving a star :star: and citation using the given BibTeX entry:
@misc{sarkar2024halva,
title={Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination},
author={Pritam Sarkar and Sayna Ebrahimi and Ali Etemad and Ahmad Beirami and Sercan Ö. Arık and Tomas Pfister},
year={2024},
eprint={2405.18654},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgement
- Downloads last month
- 19