File size: 4,267 Bytes
1d73cfa 4f44c72 e871f69 54069ef e871f69 54069ef 1d73cfa 5050cce be68328 7cbd96c ecd982a 7cbd96c 1d73cfa 4f44c72 87f38e6 560a301 7e7f221 560a301 7e7f221 a7b8c7e 7e7f221 560a301 a7b8c7e 560a301 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
library_name: transformers
pipeline_tag: image-text-to-text
datasets: Vikhrmodels/LLaVA-Instruct-ru
language:
- ru
license: apache-2.0
tags:
- multimodal
- vision
- image-text-to-text
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Русскоязычная версия Idefics, обученная на русифицированном сабсете LLaVA.
SFT был без текстовых данных, так что вполне возможно просадка по качеству на text-only данных.
Обучение было в int4 с QLoRA на consumer-grade железе.
## Model Details
### Model Description
- **Model type:** ruIdefics2
- **Language(s) (NLP):** Russian
- **License:** Apache-2.0
- **Finetuned from model:** Idefics2
# How to Get Started
## Запуск в fp16
```python
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
processor = AutoProcessor.from_pretrained("GeorgeBredis/ruIdefics2-ruLLaVA-merged")
model = AutoModelForVision2Seq.from_pretrained(
"GeorgeBredis/ruIdefics2-ruLLaVA-merged",
).to(DEVICE)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Что изображено на данной картинке?"},
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
```
Вполне возможно что это не влезет в вашу GPU (если будете загружать на gpu), так что ниже вариант с bnb для запуска в colab'e.
## Запуск в int4/int8 c bnb.
Требует установки peft
```python
import requests
import torch
from PIL import Image
from io import BytesIO
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration
from transformers.image_utils import load_image
DEVICE = "cuda:0"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
processor = AutoProcessor.from_pretrained(
"GeorgeBredis/ruIdefics2-ruLLaVA-merged",
do_image_splitting=False
)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Idefics2ForConditionalGeneration.from_pretrained(
"GeorgeBredis/ruIdefics2-ruLLaVA-merged",
torch_dtype=torch.float16,
quantization_config=quantization_config,
)
# не нужно переносить на карту, так как в int4/8 заводятся сразу на них
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Что изображено на данной картинке?"},
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
```
|