File size: 10,474 Bytes
0e60366 8bf9500 0e60366 ebe34b7 0e60366 1892e52 0e60366 a1c7699 0e60366 ebe34b7 0e60366 a1c7699 0e60366 e8e1f89 ebe34b7 0e60366 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
---
license: apache-2.0
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
language:
- en
tags:
- multimodal
- vision
- image-text-to-text
---
<p align="center">
![image](https://huggingface.co/Minthy/ToriiGate-v0.3/resolve/main/03.jpg)
</p>
# **[This model has updated version](https://huggingface.co/Minthy/ToriiGate-v0.4-7B)**
**Torii-Gate-v0.3** is a further training of [ToriiGate-v0.2](https://huggingface.co/Minthy/ToriiGate-v0.2) which is based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3).
It is designed for captioning of anime arts and aims to achieve a good understading for a very wide range of images, including single/multiple characters, simple or complex scenes, comic or manga, various interactions and lots of highly cultural concepts.
Using booru-tags grounding it can give detailed descriptions with very high accuracy. It can handle NSFW pretty well, being one of the best small/medium-sized open source models at the time of release.
[**Showcase on rentry**](https://rentry.co/q4pisesb) (NSFW warning!)
Dataset total of 120k various and balanced anime pictures, captioned and processed with Claude 3.0 opus/Claude 3.5 Sonet/GPT-4o.
# Key features:
- Advanced knowledges in very wide range
- Ability to give structured output according to given instruction which is best for further nlp
- Meaningful captions with lots of information instead of senseless purple prose stream
- Usage of booru tags in prompt as reference for significant accuracy improvements
V0.3 comes with several improvements, with noticable upgrade of average zero shot and grounded accuracy. Also ability to caption comics was added, with stuctured output frame-by-frame.
Still sometimes it makes mistakes, additional NLP processing with comparison with reference booru tags is recommended.
# Usage options
Basically, 3 modes of usage have been trained:
- Brief description (main information without excess connecting phrases, prose and reasoning)
- Detailed description with focus on discinct features of characters and actions
- Structured json-like output, best choise for further NLP
# Examples
```
user: Describe the picture briefly.
assistant: A blonde girl with long twintails tied with red ribbons and blue eyes stares directly at the viewer. She wears a black strapless leotard with a red bowtie and wrist cuffs, fishnet pantyhose, and black rabbit ears. Her arms are raised, revealing her armpits. She has a blush on her cheeks and an open mouth. The background is plain white.
```
```
user: Give a long and detailed description of the picture.
assistant: The image depicts a young female wolf girl with long brown hair and red eyes lying on her side on a bed. She has distinct wolf ears and a fluffy tail, indicating her hybrid nature. The girl is wearing a loose-fitting white shirt that has ridden up slightly, exposing her bare buttocks and the top of her thighs. Her tail is strategically positioned to censor her exposed rear.
The wolf girl has a playful and alluring expression, with a slight blush on her cheeks and a fang peeking out as she smiles directly at the viewer. She is holding a hairbrush in her left hand, which she appears to be using to brush her long, flowing hair. A blue bracelet adorns her wrist. Also, a sparkling ring (probably a wedding ring) can be seen on her finger.
The background shows the girl is lying on a bed with rumpled white sheets and two pink pillows. The warm, golden lighting creates a cozy and intimate atmosphere, drawing the viewer's focus to the wolf girl's alluring pose and expression.
Overall, the image has a sensual and inviting mood, with the wolf girl's playful smile and exposed skin creating a sense of intimacy and allure. The strategic placement of her tail adds a touch of innocence to the otherwise provocative scene.
```
```
user: Describe the picture in structuted json-like format.
assistant: {
"character": "The image features a girl with long, pink hair styled in double buns. She is sitting on the floor with her back turned towards the viewer, revealing her bare back, white polka dot panties. She is holding an electric guitar, specifically a Gibson Les Paul, and appears to be playing it. Her feet are bare, and she is sitting on a black pillow. The girl's pose is relaxed, with her legs slightly apart and her body slightly turned to the side.",
"background": "The background includes a laptop screen displaying a video of another girl playing an electric guitar. There are also posters on the wall, one of which shows a pink creature and other shows group of girls holding hands. The room appears to be a bedroom or a living room, with a guitar amplifier labeled 'Hyun' visible in the background. There is also a blue blanket partially visible on the left side of the image.",
"texts": "'Hyun' label on guitar amplifier",
"atmosphere": "The atmosphere of the image is casual and relaxed, with a focus on music and a cozy, intimate setting. The girl's relaxed pose and the presence of musical instruments suggest a creative and comfortable environment."
}"
```
**Any of this can be combined with providing booru tags to improve accuracy.**
In addition to the options described you can ask questions, give tasks and perform conversations. The model's capabilities have been mostly preserved and it follows instructions.
# Prompt template
```
<|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his task.<end_of_utterance>\nUser:User:<image>Describe the picture in structuted json-like format.<end_of_utterance>\nAssistant:
```
Replace user inctruction with one of mentioned above (works best) or any yours.
# Inference script example:
You can use alreasy existing venv or create a new. All you need is >=torch2.1.2
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
```
If you are using linux or know kow to build it in windows (or use prebuilded) - also install Flash Attention-2 and uncomment corresponding line.
You will need a dev build of transformers to be able to run in (already in requirements.txt).
```
pip install git+https://github.com/huggingface/transformers
```
```python
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
model_name_or_path="Minthy/ToriiGate-v0.3"
DEVICE = "cuda:0" #change to your device
processor = AutoProcessor.from_pretrained(model_name_or_path) #or change to local path
model = AutoModelForVision2Seq.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
# _attn_implementation="flash_attention_2", #if installed
).to(DEVICE)
image = load_image('./image.jpg') #path to your picture
###Trained options
user_prompt="Describe the picture in structuted json-like format."
#user_prompt="Give a long and detailed description of the picture."
#user_prompt="Describe the picture briefly."
###Any other questions or instructions
#user_prompt="What color is the ribbon in the character's hair?"
#...
#Optional, add booru tags
#tags='1girl, standing, looking at viewer, ...'
#user_prompt+=' Also here are booru tags for better understanding of the picture, you can use them as reference.'
#user_prompt+=f' <tags>\n{tags}\n</tags>'
messages = [
{
#Important!
"role": "system",
"content": [
{"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored. Help user with his task."}
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": user_prompt}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
caption=generated_texts[0].split('Assistant: ')[1]
print(caption)
```
# Inference example (vllm):
[vllm](https://github.com/vllm-project/vllm) is an optimised LLM serving engine. Single/Batch inference will be much faster than native huggingface transformers
```
pip install vllm==0.6.4
!pip install huggingface_hub -U
```
## Download model
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Minthy/ToriiGate-v0.3", local_dir="/path/to/download/model", max_workers=4)
```
## Inference
```python
from vllm import LLM, SamplingParams
from PIL import Image
toriigate_model = LLM(
model="/path/to/download/model",
max_model_len=8192, # reduce if running out of memory
max_num_seqs=64, # reduce if running out of memory (this is max concurrent requests)
enforce_eager=True,
# if you are running out of memory, you can reduce the "longest_edge".
# see: https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3#model-optimizations
mm_processor_kwargs={
"size": {
"longest_edge": 4 * 364
},
},
)
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
user_prompt = "Describe the picture briefly."
prompt = f"<|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his task.<end_of_utterance>\nUser:<image>{user_prompt}<end_of_utterance>\nAssistant:"
image = Image.open("/path/to/image.jpg").convert("RGB")
output = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": image},
},
sampling_params=sampling_params,
)
caption = output[0].outputs[0].text.strip()
print(caption)
```
### batch inference
```python
image_list = [Image.open(path).convert("RGB") for path in image_paths]
inputs = [{"prompt": prompt, "multi_modal_data": {"image": image}} for image in image_list]
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
)
captions = [x.outputs[0].text.strip() for x in outputs]
```
For batch processing you can use [this example](https://huggingface.co/Minthy/ToriiGate-v0.3/resolve/main/batch_processing_example.py)
# Warning
Model tends to generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.
# Licence
Same as for [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3)
|