|
--- |
|
license: mit |
|
datasets: |
|
- imirandam/TROHN-Text |
|
--- |
|
|
|
# Model Card for CLIP_TROHN-Text |
|
|
|
## Model Description |
|
- **Homepage:** https://imirandam.github.io/BiVLC_project_page/ |
|
- **Repository:** https://github.com/IMirandaM/BiVLC |
|
- **Paper:** https://arxiv.org/abs/2406.09952 |
|
- **Point of Contact:** [Imanol Miranda](mailto:imanol.miranda@ehu.eus) |
|
|
|
### Model Summary |
|
|
|
CLIP_TROHN-Text is a model presented in the [BiVLC](https://github.com/IMirandaM/BiVLC) paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative captions. The negatives present small compositional changes. Hyperparameters: |
|
|
|
* Learning rate: 1e-6. |
|
* Scheduler: Cosine scheduler with 50 warmup steps. |
|
* Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1. |
|
* Loss function: InfoNCE Loss. The loss is modified to add only negative captions following the idea proposed in NEGCLIP. |
|
* Batch size: We define a batch size of 200, and then we add negatives. As it has not hard negative images, it results in 200 images x 400 captions (positive + hard negatives). |
|
* Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set. |
|
* Data: It is fine-tuned with [TROHN-Text](https://huggingface.co/datasets/imirandam/TROHN-Text) dataset. |
|
|
|
### Evaluation Data |
|
The model is evaluated in [BiVLC](https://huggingface.co/datasets/imirandam/BiVLC). |
|
|
|
### Licensing Information |
|
This work is licensed under a MIT License. |
|
|
|
## Citation Information |
|
If you find this dataset useful, please consider citing our paper: |
|
``` |
|
@misc{miranda2024bivlc, |
|
title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, |
|
author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune}, |
|
year={2024}, |
|
eprint={2406.09952}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |