sbrzz's picture
Update README.md
09d5726 verified
|
raw
history blame
4.02 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2.5-0.5B-Instruct
  - facebook/dinov2-small
pipeline_tag: visual-question-answering

Pretrain stage only, 4630 epochs

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

POPE:

Category # Samples TP FP TN FN Accuracy Precision Recall F1 Score Yes Ratio
Adversarial 3000 1312 1250 250 188 0.521 0.512 0.875 0.646 0.854
Popular 3000 1312 1236 264 188 0.525 0.515 0.875 0.648 0.849
Random 2910 1312 1185 225 188 0.528 0.525 0.875 0.656 0.858

TEXTVQA

Samples 5000, Accuracy 0% (:-|)

SCIENCEQA

Samples 4241, Correct: -, Accuracy: -%, IMG-Accuracy: -%

MMMU

Category # Samples Accuracy
Overall 900 0.280
Overall-Art and Design 120 0.208
Art 30 0.167
Art Theory 30 0.200
Design 30 0.367
Music 30 0.100
Overall-Business 150 0.213
Accounting 30 0.100
Economics 30 0.367
Finance 30 0.200
Management 30 0.233
Marketing 30 0.167
Overall-Science 150 0.300
Biology 30 0.300
Chemistry 30 0.133
Geography 30 0.300
Math 30 0.333
Physics 30 0.433
Overall-Health and Medicine 150 0.340
Basic Medical Science 30 0.300
Clinical Medicine 30 0.133
Diagnostics and Laboratory Med. 30 0.333
Pharmacy 30 0.400
Public Health 30 0.533
Overall-Humanities and Soc. Sci. 120 0.342
History 30 0.300
Literature 30 0.567
Sociology 30 0.233
Psychology 30 0.267
Overall-Tech and Engineering 210 0.276
Agriculture 30 0.300
Architecture and Engineering 30 0.200
Computer Science 30 0.367
Electronics 30 0.200
Energy and Power 30 0.367
Materials 30 0.233
Mechanical Engineering 30 0.267