File size: 3,280 Bytes
64206ad
d932c54
 
 
64206ad
 
d932c54
64206ad
d932c54
64206ad
d932c54
64206ad
d932c54
64206ad
e69afc2
 
 
 
 
 
d932c54
64206ad
d932c54
64206ad
d932c54
 
64206ad
d932c54
64206ad
e69afc2
 
f496aeb
e69afc2
 
 
 
 
 
 
 
51405cf
 
e69afc2
ef4abcf
e69afc2
 
 
 
 
f58523c
e69afc2
 
 
 
 
64206ad
e69afc2
 
d932c54
64206ad
d932c54
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
tags:
- vision
- image-text-to-text
---

# LLaVa-Next, leveraging [NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as LLM

The LLaVA-NeXT model was proposed in [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon [LLaVa](llava) by increasing the input image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense reasoning.

Disclaimer: The team releasing LLaVa-NeXT did not write a model card for this model so this model card has been written by the Hugging Face team.

## Model description

LLaVa combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. LLaVA 1.6 improves on LLaVA 1.5 BY:
- Using [Mistral-7B](https://mistral.ai/news/announcing-mistral-7b/) and [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)  (for this checkpoint) which has better commercial licenses,
  and bilingual support
- More diverse and high quality data mixture
- Dynamic high resolution
  
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)

## Intended uses & limitations

You can use the raw model for tasks like image captioning, visual question answering, multimodal chatbot use cases. See the [model hub](https://huggingface.co/models?search=llava-hf) to look for
other versions on a task that interests you.

### How to use

Here's the prompt template for this model:
```
"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\n<your_text_prompt_here><|im_end|><|im_start|>assistant\n"
```
You can load and use the model like following:
```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

# TODO support fast tokenizer here
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-34b-hf", use_fast=False)

model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-34b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) 
model.to("cuda:0")

# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))
```
### BibTeX entry and citation info

```bibtex
@misc{liu2023improved,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Haotian Liu and Chunyuan Li and Yuheng Li and Yong Jae Lee},
      year={2023},
      eprint={2310.03744},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```