|
--- |
|
language: |
|
- ja |
|
license: llama3 |
|
tags: |
|
- multimodal |
|
- vision-language |
|
- mantis |
|
- llava |
|
- llama3 |
|
- siglip |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
# 🐟 Llama-3-EvoVLM-JP-v2 |
|
|
|
🤗 [Models](https://huggingface.co/SakanaAI) | 📚 [Paper](https://arxiv.org/abs/2403.13187) | 📝 [Blog](https://sakana.ai/evovlm-jp/) | 🐦 [Twitter](https://twitter.com/SakanaAILabs) |
|
|
|
|
|
**Llama-3-EvoVLM-JP-v2** is an experimental general-purpose Japanese VLM with **interleaved text and image as inputs**. |
|
This model was created using the Evolutionary Model Merge method. |
|
Please refer to our [report](https://arxiv.org/abs/2403.13187) and [blog](https://sakana.ai/evovlm-jp/) for more details. |
|
This model was produced by merging the following models. |
|
We are grateful to the developers of the source models. |
|
|
|
- [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
|
- [Mantis-8B-siglip-llama3](https://huggingface.co/TIGER-Lab/Mantis-8B-siglip-llama3) |
|
- [Bunny-v1.1-Llama-3-8B-V](https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V) |
|
|
|
|
|
|
|
## Usage |
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
First, you need to install packages for inference using the Mantis. See [here](https://huggingface.co/TIGER-Lab/Mantis-8B-siglip-llama3#installation). |
|
```bash |
|
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git |
|
``` |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
|
|
import torch |
|
from mantis.models.conversation import Conversation, SeparatorStyle |
|
from mantis.models.mllava import chat_mllava, LlavaForConditionalGeneration, MLlavaProcessor |
|
from mantis.models.mllava.utils import conv_templates |
|
from transformers import AutoTokenizer |
|
|
|
# 1. Set the system prompt |
|
conv_llama_3_elyza = Conversation( |
|
system="<|start_header_id|>system<|end_header_id|>\n\nあなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。", |
|
roles=("user", "assistant"), |
|
messages=(), |
|
offset=0, |
|
sep_style=SeparatorStyle.LLAMA_3, |
|
sep="<|eot_id|>", |
|
) |
|
conv_templates["llama_3"] = conv_llama_3_elyza |
|
|
|
# 2. Load model |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_id = "SakanaAI/Llama-3-EvoVLM-JP-v2" |
|
|
|
processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3") |
|
processor.tokenizer.pad_token = processor.tokenizer.eos_token |
|
|
|
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device).eval() |
|
|
|
# 3. Prepare a generate config |
|
generation_kwargs = { |
|
"max_new_tokens": 128, |
|
"num_beams": 1, |
|
"do_sample": False, |
|
"no_repeat_ngram_size": 3, |
|
} |
|
|
|
# 4. Generate |
|
text = "<image>の信号は何色ですか?" |
|
url_list = [ |
|
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D", |
|
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" |
|
] |
|
images = [ |
|
Image.open(requests.get(url_list[0], stream=True).raw).convert("RGB") |
|
] |
|
|
|
response, history = chat_mllava(text, images, model, processor, **generation_kwargs) |
|
|
|
print(response) |
|
# 信号の色は、青色です。 |
|
|
|
# 5. Multi-turn conversation |
|
text = "では、<image>の信号は?" |
|
images += [ |
|
Image.open(requests.get(url_list[1], stream=True).raw).convert("RGB") |
|
] |
|
response, history = chat_mllava(text, images, model, processor, history=history, **generation_kwargs) |
|
|
|
print(response) |
|
# 赤色 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** [Sakana AI](https://sakana.ai/) |
|
- **Model type:** Autoregressive Language Model |
|
- **Language(s):** Japanese |
|
- **Optimization data:** subsets of the [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa) and the translated [ShareGPT4V](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) |
|
- **License:** [META LLAMA 3 COMMUNITY LICENSE](https://llama.meta.com/llama3/license/) |
|
- **Paper:** https://arxiv.org/abs/2403.13187 |
|
- **Blog:** https://sakana.ai/evovlm-jp/ |
|
|
|
|
|
## Uses |
|
This model is provided for research and development purposes only and should be considered as an experimental prototype. |
|
It is not intended for commercial use or deployment in mission-critical environments. |
|
Use of this model is at the user's own risk, and its performance and outcomes are not guaranteed. |
|
Sakana AI shall not be liable for any direct, indirect, special, incidental, or consequential damages, or any loss arising from the use of this model, regardless of the results obtained. |
|
Users must fully understand the risks associated with the use of this model and use it at their own discretion. |
|
|
|
|
|
## Acknowledgement |
|
|
|
We would like to thank the developers of the source models for their contributions and for making their work available. |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{Llama-3-EvoVLM-JP-v2, |
|
url = {[https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)}, |
|
title = {Llama-3-EvoVLM-JP-v2}, |
|
author = {Yuichi, Inoue and Takuya, Akiba and Shing, Makoto} |
|
} |
|
``` |
|
|
|
```bibtex |
|
@misc{akiba2024evomodelmerge, |
|
title = {Evolutionary Optimization of Model Merging Recipes}, |
|
author. = {Takuya Akiba and Makoto Shing and Yujin Tang and Qi Sun and David Ha}, |
|
year = {2024}, |
|
eprint = {2403.13187}, |
|
archivePrefix = {arXiv}, |
|
primaryClass = {cs.NE} |
|
} |
|
``` |