--- base_model: llava-hf/llava-v1.6-mistral-7b-hf library_name: peft license: apache-2.0 datasets: - mirzaei2114/stackoverflowVQA-filtered-small language: - en tags: - llava - llava-next - fine-tuned - stack-overflow - qlora - images - vqa - 4bit --- # Model Card for Model ID Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images. ## Model Details ### Model Description This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA) on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both text and image inputs. The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset. Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept to size to capture detail needed for methods such as optical character recognition. - **Developed by:** Adam Cassidy - **Model type:** Visual QA - **Language(s) (NLP):** EN - **License:** Apache License, Version 2.0 - **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf ### Model Sources - **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) ## Uses Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end) and accompany it with the question for inference. ### Direct Use Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images. ### Out-of-Scope Use General-purpose VQA tasks, though performance on non-technical domains may vary. ## Bias, Risks, and Limitations Model Capacity: The model was trained using 4-bit QLoRA. Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow. ## How to Get Started with the Model To use this model, ensure you have the following dependencies installed: torch==2.4.1+cu121 transformers==4.45.1 Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can ## Training Details ### Training Data [mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train) ### Training Procedure #### Training Hyperparameters TrainingArguments(     per_device_train_batch_size=4,     per_device_eval_batch_size=4,     max_grad_norm=0.1,     evaluation_strategy="steps",     eval_steps=15,     group_by_length=True,     logging_steps=15,     gradient_checkpointing=True,     gradient_accumulation_steps=2,     num_train_epochs=3,     weight_decay=0.1,     warmup_steps=10,     lr_scheduler_type="cosine",     learning_rate=1e-5,     save_steps=15,     save_total_limit=5,     bf16=True,     remove_unused_columns=False ) #### Speeds, Sizes, Times checkpoint-240 ## Evaluation Evaluation Loss (Pre-finetuning): 2.93 Validation Loss (Post-finetuning): 1.78 ### Testing Data, Factors & Metrics #### Testing Data [mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test) ### Compute Infrastructure #### Hardware L4 GPU #### Software Google Colab ### Framework versions - PEFT 0.13.1.dev0 - PyTorch 2.4.1+cu121 - Transformers 4.45.1