metadata

license: mit
datasets:
  - imirandam/TROHN-Text

Model Card for CLIP_TROHN-Text

Model Description

Homepage: https://imirandam.github.io/BiVLC_project_page/
Repository: https://github.com/IMirandaM/BiVLC
Paper: https://arxiv.org/abs/2406.09952
Point of Contact: Imanol Miranda

Model Summary

CLIP_TROHN-Text is a model presented in the BiVLC paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative captions. The negatives present small compositional changes. Hyperparameters:

Learning rate: 1e-6.
Scheduler: Cosine scheduler with 50 warmup steps.
Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
Loss function: InfoNCE Loss. The loss is modified to add only negative captions following the idea proposed in NEGCLIP.
Batch size: We define a batch size of 200, and then we add negatives. As it has not hard negative images, it results in 200 images x 400 captions (positive + hard negatives).
Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
Data: It is fine-tuned with TROHN-Text dataset.

Evaluation Data

The model is evaluated in BiVLC.

Licensing Information

This work is licensed under a MIT License.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}