Visual Question Answering
Transformers
Safetensors
Chinese
English
QH_360VL
text-generation
custom_code
File size: 6,166 Bytes
8674dcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71da538
8674dcf
 
3d61ed0
8674dcf
5e09d5a
8674dcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b1b842
8674dcf
0b1b842
8674dcf
0b1b842
 
8674dcf
 
 
 
 
 
 
102ae2e
8674dcf
dd4f242
8674dcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: apache-2.0
datasets:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
- liuhaotian/LLaVA-Instruct-150K
- FreedomIntelligence/ALLaVA-4V-Chinese
- shareAI/ShareGPT-Chinese-English-90k
language:
- zh
- en
pipeline_tag: visual-question-answering
---
<br>
<br>

# Model Card for 360VL 
<p align="center">
  <img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
</p>

**360VL** is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.

**Github**:https://github.com/360CVGroup/360VL
## Model Zoo

360VL has released the following versions.

Model |       Download
|---|---
360VL-8B |  [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B) 
360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)  
## Features

360VL offers the following features:

- Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
  
- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.  
  
- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
  
- Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672&times;672.

## Performance
| Model               | Checkpoints   | MMB<sub>T  | MMB<sub>D|MMB-CN<sub>T  | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
|:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
| QWen-VL-Chat |  [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 |  56.3  |  56.7  |37| 32.9  | 1860 |
| mPLUG-Owl2 |  [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 |  60.3  |  59.5  |34.7| 32.1  | 1786.4 |
| CogVLM |  [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9  | 53.8    |37.3| 30.1 | 1736.6|
| Monkey-Chat |  [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5  | 65.8    |40.7| - | 1887.4|
| MM1-7B-Chat |  [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | -  | -    |37.0| 35.6 |  1858.2|
| IDEFICS2-8B |  [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6  | 67.3    |43.0| 37.7 |1847.6| 
| SVIT-v1.5-13B|  [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1  |  -  | 38.0| 33.3|1889| 
| LLaVA-v1.5-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65  | 63.6    |36.4| 33.6 | 1826.7|  
| LLaVA-v1.6-13B |  [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5  | 64.3    |36.2| - |1901|
| Honeybee |  [LINK](https://github.com/kakaobrain/honeybee) | 73.6 | 74.3 | -  | -    |36.2|  -|1976.5|
| YI-VL-34B |  [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 |  70.7 |   71.4  |45.1| 41.6 |2050.2|
| **360VL-8B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1   | 68.6    |39.7| 37.1 |  1944.6|
| **360VL-70B** |  [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9   | 77.7    |50.8| 44.3 |  2012.3|
## Quick Start 🤗

```Shell
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/360VL-8B"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token


image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
```

**Model type:**
360VL-8B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
It is an auto-regressive language model, based on the transformer architecture.
Base LLM: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

**Model date:**
360VL-8B was trained in April 2024.



## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
The content of this project itself is licensed under the [Apache license 2.0]

**Where to send questions or comments about the model:**
https://github.com/360CVGroup/360VL

## Related Projects
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
- [Meta Llama 3](https://github.com/meta-llama/llama3)
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
- [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)