metadata
license: mit
datasets:
- imirandam/TROHN-Text
Model Card for CLIP_TROHN-Text
Model Description
- Homepage: https://imirandam.github.io/BiVLC_project_page/
- Repository: https://github.com/IMirandaM/BiVLC
- Paper: https://arxiv.org/abs/2406.09952
- Point of Contact: Imanol Miranda
Model Summary
CLIP_TROHN-Text is a model presented in the BiVLC paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative captions. The negatives present small compositional changes. Hyperparameters:
- Learning rate: 1e-6.
- Scheduler: Cosine scheduler with 50 warmup steps.
- Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
- Loss function: InfoNCE Loss. The loss is modified to add only negative captions following the idea proposed in NEGCLIP.
- Batch size: We define a batch size of 200, and then we add negatives. As it has not hard negative images, it results in 200 images x 400 captions (positive + hard negatives).
- Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
- Data: It is fine-tuned with TROHN-Text dataset.
Evaluation Data
The model is evaluated in BiVLC.
Licensing Information
This work is licensed under a MIT License.
Citation Information
If you find this dataset useful, please consider citing our paper:
@misc{miranda2024bivlc,
title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval},
author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
year={2024},
eprint={2406.09952},
archivePrefix={arXiv},
primaryClass={cs.CV}
}