When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-7b model trained with S²-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version	Size	Schedule	Checkpoint	VQAv2	VizWiz	TextVQA	MMMU-val	MathVista	MM-Bench	SEED	MM-Vet
LLaVA-1.5	7B	full_ft-1e	liuhaotian/llava-v1.5-7b	78.5	50.0	58.2	36.2	25.2	64.3	65.7	31.1
LLaVA-1.5	7B	lora-1e	liuhaotian/llava-v1.5-7b-lora	79.1	47.8	58.2	-	-	66.1	-	30.2
LLaVA-1.5-S2	7B	lora-1e	this model	80.0	50.1	61.0	37.7	25.3	66.2	67.9	32.4

bfshi
/

llava-v1.5-7b-s2-lora

When Do We Not Need Larger Vision Models?

Model

Training

Benchmarking

License