Model Card for BioTrove-CLIP
BioTrove-CLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on BioTrove-Train, which is a large-scale dataset of 40 million
images of 33K species
of plants and animals. The models are evaluated on zero-shot image classification tasks.
- Model type: Vision Transformer (ViT-B/16, ViT-L/14)
- License: MIT
- Fine-tuned from model: OpenAI CLIP, MetaCLIP, BioCLIP
These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.
Model Description
BioTrove-CLIP is based on OpenAI's CLIP model. The models were trained on BioTrove-Train for the following configurations:
- BioTrove-CLIP-O: Trained a ViT-B/16 backbone initialized from the OpenCLIP's checkpoint. The training was conducted for 40 epochs.
- BioTrove-CLIP-B: Trained a ViT-B/16 backbone initialized from the BioCLIP's checkpoint. The training was conducted for 8 epochs.
- BioTrove-CLIP-M: Trained a ViT-L/14 backbone initialized from the MetaCLIP's checkpoint. The training was conducted for 12 epochs.
To access the checkpoints of the above models, go to the Files and versions
tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights -
- BioTrove-CLIP-O: -
biotroveclip-vit-b-16-from-openai-epoch-40.pt
, - BioTrove-CLIP-B: -
biotroveclip-vit-b-16-from-bioclip-epoch-8.pt
- BioTrove-CLIP-M -
biotroveclip-vit-l-14-from-metaclip-epoch-12.pt
Model Training
See the Model Training section on the Github for examples of how to use BioTrove-CLIP models in zero-shot image classification tasks.
We train three models using a modified version of the BioCLIP / OpenCLIP codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's Greene high-performance compute cluster. We publicly release all code needed to reproduce our results on the Github page.
We optimize our hyperparameters prior to training with Ray. Our standard training parameters are as follows:
--dataset-type webdataset
--pretrained openai
--text_type random
--dataset-resampled
--warmup 5000
--batch-size 4096
--accum-freq 1
--epochs 40
--workers 8
--model ViT-B-16
--lr 0.0005
--wd 0.0004
--precision bf16
--beta1 0.98
--beta2 0.99
--eps 1.0e-6
--local-loss
--gather-with-grad
--ddp-static-graph
--grad-checkpointing
For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the OpenCLIP and BioCLIP documentation, respectively.
Model Validation
For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the VLHub repository with some slight modifications.
Pre-Run
After cloning the Github repository and navigating to the BioTrove/model_validation
directory, we recommend installing all the project requirements into a conda container; pip install -r requirements.txt
. Also, before executing a command in VLHub, please add BioTrove/model_validation/src
to your PYTHONPATH.
export PYTHONPATH="$PYTHONPATH:$PWD/src";
Base Command
A basic BioTrove-CLIP model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the --resume
flag on the ImageNet validation set, and would report the results to Weights and Biases.
python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb
Training Links
- Main Dataset Repository: BioTrove
- Dataset Paper: BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity (arXiv)
- HF Dataset card: BioTrove-Train (40M)
Model's Limitation
All the BioTrove-CLIP
models were evaluated on the challenging CONFOUNDING-SPECIES benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities.
In general, we found that models trained on web-scraped data performed better with common
names, whereas models trained on specialist datasets performed better when using scientific names.
Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic
level (kingdom), while models begin to benefit from specialist datasets like BioTrove-Train (40M) and
Tree-of-Life-10M at the lower taxonomic levels (order and species). From a practical standpoint, BioTrove-CLIP
is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones.
Addressing these limitations will further enhance the applicability of models like BioTrove-CLIP
in real-world biodiversity monitoring tasks.
Acknowledgements
This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under AI Institute: for Resilient Agriculture, Award No. 2021-67021-35329. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully acknowledge the support of NYU IT High Performance Computing resources, services, and staff expertise.
Citation
If you find the models and datasets useful in your research, please consider citing our paper:@misc{yang2024arboretumlargemultimodaldataset,
title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity},
author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
year={2024},
eprint={2406.17720},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.17720},
}
For more details and access to the Arboretum dataset, please visit the Project Page.
- Downloads last month
- 46
Model tree for BGLab/BioTrove-CLIP
Base model
openai/clip-vit-base-patch16