|
--- |
|
language: |
|
- en |
|
tags: |
|
- llava |
|
- multimodal |
|
- qwen |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
# nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model |
|
|
|
<p align="center"> |
|
<img src="https://i.postimg.cc/d15k3YNG/nanollava.webp" alt="Logo" width="350"> |
|
</p> |
|
|
|
## Description |
|
nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version [qnguyen3/nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA) |
|
- **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) |
|
- **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
|
|
|
| Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |
|
|---------|--------|---------|-----------|------|-------------|-------------|------|--------| |
|
| nanoLLavA-1.0 | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | |
|
| nanoLLavA-1.5 | TBD | TBD | TBD | TBD | TBD | TBD | TBD| TBD | |
|
|
|
## Training Data |
|
Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. |
|
|
|
## Finetuning Code |
|
Coming Soon!!! |
|
|
|
## Usage |
|
You can use with `transformers` with the following script: |
|
|
|
```bash |
|
pip install -U transformers accelerate flash_attn |
|
``` |
|
|
|
```python |
|
import torch |
|
import transformers |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from PIL import Image |
|
import warnings |
|
|
|
# disable some warnings |
|
transformers.logging.set_verbosity_error() |
|
transformers.logging.disable_progress_bar() |
|
warnings.filterwarnings('ignore') |
|
|
|
# set device |
|
torch.set_default_device('cuda') # or 'cpu' |
|
|
|
model_name = 'qnguyen3/nanoLLaVA-1.5' |
|
|
|
# create model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.float16, |
|
device_map='auto', |
|
trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model_name, |
|
trust_remote_code=True) |
|
|
|
# text prompt |
|
prompt = 'Describe this image in detail' |
|
|
|
messages = [ |
|
{"role": "user", "content": f'<image>\n{prompt}'} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
print(text) |
|
|
|
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')] |
|
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0) |
|
|
|
# image, sample images can be found in images folder |
|
image = Image.open('/path/to/image.png') |
|
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) |
|
|
|
# generate |
|
output_ids = model.generate( |
|
input_ids, |
|
images=image_tensor, |
|
max_new_tokens=2048, |
|
use_cache=True)[0] |
|
|
|
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) |
|
``` |
|
|
|
## Prompt Format |
|
The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`: |
|
``` |
|
<|im_start|>system |
|
Answer the question<|im_end|><|im_start|>user |
|
<image> |
|
What is the picture about?<|im_end|><|im_start|>assistant |
|
``` |
|
|
|
<!-- --- |
|
| Image | Example | |
|
|--------------------------------------|---------------------------------------------------------------------------------------------| |
|
| ![small](example_1.png) | **What is the text saying?** <br> "Small but mighty". <br>**How does the text correlate to the context of the image?** <br> The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar. | |
|
--- --> |
|
|
|
Model is trained using a modified version from [Bunny](https://github.com/BAAI-DCAI/Bunny/tree/main/bunny) |