Zhang199/TinyLLaVA-Qwen2.5-3B-SigLIP

TinyLLaVA

Here, we introduce TinyLLaVA-Qwen2.5-3B-SigLIP , which is trained by the TinyLLaVA Factory codebase. For LLM and vision tower, we choose Qwen2.5-3B and siglip-so400m-patch14-384, respectively.

Usage

Execute the following test code:

from transformers import AutoTokenizer, AutoModelForCausalLM

hf_path = 'Zhang199/TinyLLaVA-Qwen2.5-3B-SigLIP'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
model.cuda()
config = model.config
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
prompt="What are these?"
image_url="http://images.cocodataset.org/test-stuff2017/000000000001.jpg"
output_text, genertaion_time = model.chat(prompt=prompt, image=image_url, tokenizer=tokenizer)

print('model output:', output_text)
print('runing time:', genertaion_time)

Result

model_name	vqav2	gqa	sqa	textvqa	MM-VET	POPE	MME	MMMU
LLaVA-1.5-7B	78.5	62.0	66.8	58.2	30.5	85.9	1510.7	-
bczhou/TinyLLaVA-3.1B (our legacy model)	79.9	62.0	69.1	59.1	32.0	86.4	1464.9	-
tinyllava/TinyLLaVA-Gemma-SigLIP-2.4B	78.4	61.6	64.4	53.6	26.9	86.4	1339.0	31.7
tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B	80.1	62.1	73.0	60.3	37.5	87.2	1466.4	38.4
Zhang199/TinyLLaVA-Qwen2-0.5B-SigLIP	72.33	55.84	60.14	45.17	19.5	86.59	1153	29.7
Zhang199/TinyLLaVA-Qwen2.5-3B-SigLIP	79.4	62.5	74.1	58.3	34.8	87.4	1438.7	39.9

P.S. TinyLLaVA Factory is an open-source modular codebase for small-scale LMMs with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. This code repository provides standard training&evaluating pipelines, flexible data preprocessing&model configurations, and easily extensible architectures. Users can customize their own LMMs with minimal coding effort and less coding mistake.

TinyLLaVA Factory integrates a suite of cutting-edge models and methods.

LLM currently supports OpenELM, TinyLlama, StableLM, Qwen, Gemma, Phi, and Qwen2.
Vision tower currently supports CLIP, SigLIP, Dino, and combination of CLIP and Dino.
Connector currently supports MLP, Qformer, and Resampler.