Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination

ICLR 2025

Pritam Sarkar Sayna Ebrahimi Ali Etemad
Ahmad Beirami Sercan O Arik Tomas Pfister

[arXiV] [OpenReview] [GitHub] [Model Weights 🤗] [Training Data]

Please see our GitHUb repo for details.

Setup environment

conda create -n halva python=3.10 -y
conda activate halva
pip install --upgrade pip
pip install -r req.txt
module load cuda/11.7.1
pip install flash-attn --no-build-isolation

Try HALVA!

We share a minimal setup to quickly try our HALVA! See this notebook.

Model weights

Training HALVA

Data

generative data augmented contrastive samples

Vision-language instructions and their correct and hallucinated responses are available here: data
Download the images from Visual Genome and save both part 1 and part 2 as data/vg/VG_100K and data/vg/VG_100K_2

reference samples

A random subset from llava_v1_5_mix665k.json. For reproducibility, we share the actual subset that has been used in our study: ref data
Image sources:
- MSCOCO - download them as data/MSCOCO2017
- TextVQA - download them as data/textvqa
- GQA - download them as data/gqa
- OCR-VQA - download them as data/ocr_vqa

Train

The base model LLaVA-v1.5 weights can be found here: 7B and 13B.
We use 4-A100 80GB GPUs for training, which takes 1.5 hours and 3 hours for training 7B and 13B variants, respectively. If you are using different GPUs, please make sure to match our default batch_size x gradient accumulation steps, for optimal performance with the default hyperparameters.
The following training script can be used to train HALVA that uses LLaVA 1.5 as the base model:
- HALVA-7B: src/hallava_7b.sh
- HALVA-13B: src/hallava_13b.sh

Evaluation on hallucination benchmarks

Choose the HALVA variant and their base model. We provide sample validation scripts for evaluation, please make sure to update the paths based on your setup.

MODEL="halva13b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-13b"

# OR

MODEL="halva7b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-7b"

CHAIR

Download the validation images from MSCOCO2014 and store them as data/MSCOCO2014/val2014. We use the same 500 images for validation, as used in prior work.
You can use the given sample script for evaluation.

##### run chair
bash src/evaluate_hall/chair.sh ${MODEL} ${MODEL_BASE}

MME-Hall

MME-Hall is a subset of MME consisting of existence, count, position, and color.
You can follow the official instructions for MME evaluation: link and download the MME benchmark.
Once the data is downloaded you can use the given sample script for evaluation.

##### run mme
bash src/evaluate_hall/mme.sh ${MODEL} ${MODEL_BASE}

AMBER

Download the validation images are from the source repo AMBER and keep them as data/amber/image/.
Download the annotation data directory and save as eval_hall/amber/data.
Once the data is downloaded you can use the given sample script for evaluation.

##### run amber evaluation on 4 GPUs in parallel if available, else run sequentially by removing & from the end
bash src/evaluate_hall/amber.sh g ${MODEL} ${MODEL_BASE} 0 &
bash src/evaluate_hall/amber.sh da ${MODEL} ${MODEL_BASE} 1 &
bash src/evaluate_hall/amber.sh dr ${MODEL} ${MODEL_BASE} 2 &
bash src/evaluate_hall/amber.sh de ${MODEL} ${MODEL_BASE} 3 &
wait
# get amber f1 for all discriminative tasks
bash src/evaluate_hall/amber_f1.sh ${MODEL}

MMHal-Bench

The validation data will be directly downloaded from HuggingFace. You can use the given sample script for evaluation.

##### run mmhal-bench
bash src/evaluate_hall/mmhal.sh ${MODEL} ${MODEL_BASE} 0

HallusionBench

Download the validation images from link and save them in data/hallusion_bench.
Download the annotation files from link and save them in eval_hall/hallusion_bench.
For more details, you can check the official repo. You can use the given sample script for evaluation.

##### run halusion-bench
bash src/evaluate_hall/hallusionbench.sh ${MODEL} ${MODEL_BASE} 0

Evaluation on general vision-language tasks

In addition to the above-mentioned evaluation on hallucination benchmarks, we also evaluate on general vision-language benchmarks. For those, we directly follow LLaVA repo as follows:

VILA

The above instructions are mainly related to the LLaVA 1.5 based checkpoints, you can find the VILA codes inside *_vila directories.

Citation

If you find this repository useful, please consider giving a star :star: and citation using the given BibTeX entry:

@misc{sarkar2024halva,
      title={Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination}, 
      author={Pritam Sarkar and Sayna Ebrahimi and Ali Etemad and Ahmad Beirami and Sercan Ö. Arık and Tomas Pfister},
      year={2024},
      eprint={2405.18654},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This code base is built upon LLaVA and VILA.

pritamqu
/

halva7b-lora

Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination

ICLR 2025

Pritam Sarkar Sayna Ebrahimi Ali Etemad
Ahmad Beirami Sercan O Arik Tomas Pfister

[arXiV] [OpenReview] [GitHub] [Model Weights 🤗] [Training Data]

Setup environment

Try HALVA!

Model weights

Training HALVA

Data

Train

Evaluation on hallucination benchmarks

CHAIR

MME-Hall

AMBER

MMHal-Bench

HallusionBench

Evaluation on general vision-language tasks

VILA

Citation

Acknowledgement

Collection including pritamqu/halva7b-lora

HALVA

Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination

ICLR 2025

Pritam Sarkar Sayna Ebrahimi Ali Etemad Ahmad Beirami Sercan O Arik Tomas Pfister

[arXiV] [OpenReview] [GitHub] [Model Weights 🤗] [Training Data]

Setup environment

Try HALVA!

Model weights

Training HALVA

Data

Train

Evaluation on hallucination benchmarks

CHAIR

MME-Hall

AMBER

MMHal-Bench

HallusionBench

Evaluation on general vision-language tasks

VILA

Citation

Acknowledgement

Collection including pritamqu/halva7b-lora

Pritam Sarkar Sayna Ebrahimi Ali Etemad
Ahmad Beirami Sercan O Arik Tomas Pfister