|
--- |
|
license: mit |
|
base_model: |
|
- mistralai/Pixtral-12B-2409 |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- lora |
|
datasets: |
|
- Multimodal-Fatima/FGVC_Aircraft_train |
|
- takara-ai/FloodNet_2021-Track_2_Dataset_HF |
|
--- |
|
# pixtral_aerial_VQA_adapter |
|
|
|
## Model Details |
|
|
|
- **Type**: LoRA Adapter |
|
- **Total Parameters**: 6,225,920 |
|
- **Memory Usage**: 23.75 MB |
|
- **Precisions**: torch.float32 |
|
- **Layer Types**: |
|
- lora_A: 40 |
|
- lora_B: 40 |
|
|
|
## Intended Use |
|
|
|
- **Primary intended uses**: Processing aerial footage of construction sites for structural and construction surveying. |
|
- Can also be applied to any detailed VQA use cases with aerial footage. |
|
|
|
## Training Data |
|
|
|
- **Dataset**: |
|
1. FloodNet Track 2 dataset |
|
2. Subset of FGVC Aircraft dataset |
|
3. Custom dataset of 10 image-caption pairs created using Pixtral |
|
|
|
## Training Procedure |
|
|
|
- **Training method**: LoRA (Low-Rank Adaptation) |
|
- **Base model**: Ertugrul/Pixtral-12B-Captioner-Relaxed |
|
- **Training hardware**: Nebius-hosted NVIDIA H100 machine |
|
|
|
## Citation |
|
|
|
```bibtext |
|
@misc{rahnemoonfar2020floodnet, |
|
title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding}, |
|
author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy}, |
|
year={2020}, |
|
eprint={2012.02951}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
doi={10.48550/arXiv.2012.02951} |
|
} |
|
``` |