takara-ai
/

pixtral_aerial_VQA_adapter

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

pixtral_aerial_VQA_adapter / README.md

takarajordan's picture

Update README.md

dbb98f9 verified 2 months ago

|

history blame contribute delete

1.42 kB

	---
	license: mit
	base_model:
	- mistralai/Pixtral-12B-2409
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- lora
	datasets:
	- Multimodal-Fatima/FGVC_Aircraft_train
	- takara-ai/FloodNet_2021-Track_2_Dataset_HF
	---
	# pixtral_aerial_VQA_adapter

	## Model Details

	- Type: LoRA Adapter
	- Total Parameters: 6,225,920
	- Memory Usage: 23.75 MB
	- Precisions: torch.float32
	- Layer Types:
	- lora_A: 40
	- lora_B: 40

	## Intended Use

	- Primary intended uses: Processing aerial footage of construction sites for structural and construction surveying.
	- Can also be applied to any detailed VQA use cases with aerial footage.

	## Training Data

	- Dataset:
	1. FloodNet Track 2 dataset
	2. Subset of FGVC Aircraft dataset
	3. Custom dataset of 10 image-caption pairs created using Pixtral

	## Training Procedure

	- Training method: LoRA (Low-Rank Adaptation)
	- Base model: Ertugrul/Pixtral-12B-Captioner-Relaxed
	- Training hardware: Nebius-hosted NVIDIA H100 machine

	## Citation

	```bibtext
	@misc{rahnemoonfar2020floodnet,
	title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding},
	author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy},
	year={2020},
	eprint={2012.02951},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	doi={10.48550/arXiv.2012.02951}
	}
	```