license: apache-2.0
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
language:
- en
tags:
- multimodal
- vision
- image-text-to-text
This model has updated version
Torii-Gate-v0.3 is a further training of ToriiGate-v0.2 which is based on Idefics3.
It is designed for captioning of anime arts and aims to achieve a good understading for a very wide range of images, including single/multiple characters, simple or complex scenes, comic or manga, various interactions and lots of highly cultural concepts. Using booru-tags grounding it can give detailed descriptions with very high accuracy. It can handle NSFW pretty well, being one of the best small/medium-sized open source models at the time of release.
Showcase on rentry (NSFW warning!)
Dataset total of 120k various and balanced anime pictures, captioned and processed with Claude 3.0 opus/Claude 3.5 Sonet/GPT-4o.
Key features:
- Advanced knowledges in very wide range
- Ability to give structured output according to given instruction which is best for further nlp
- Meaningful captions with lots of information instead of senseless purple prose stream
- Usage of booru tags in prompt as reference for significant accuracy improvements
V0.3 comes with several improvements, with noticable upgrade of average zero shot and grounded accuracy. Also ability to caption comics was added, with stuctured output frame-by-frame.
Still sometimes it makes mistakes, additional NLP processing with comparison with reference booru tags is recommended.
Usage options
Basically, 3 modes of usage have been trained:
- Brief description (main information without excess connecting phrases, prose and reasoning)
- Detailed description with focus on discinct features of characters and actions
- Structured json-like output, best choise for further NLP
Examples
user: Describe the picture briefly.
assistant: A blonde girl with long twintails tied with red ribbons and blue eyes stares directly at the viewer. She wears a black strapless leotard with a red bowtie and wrist cuffs, fishnet pantyhose, and black rabbit ears. Her arms are raised, revealing her armpits. She has a blush on her cheeks and an open mouth. The background is plain white.
user: Give a long and detailed description of the picture.
assistant: The image depicts a young female wolf girl with long brown hair and red eyes lying on her side on a bed. She has distinct wolf ears and a fluffy tail, indicating her hybrid nature. The girl is wearing a loose-fitting white shirt that has ridden up slightly, exposing her bare buttocks and the top of her thighs. Her tail is strategically positioned to censor her exposed rear.
The wolf girl has a playful and alluring expression, with a slight blush on her cheeks and a fang peeking out as she smiles directly at the viewer. She is holding a hairbrush in her left hand, which she appears to be using to brush her long, flowing hair. A blue bracelet adorns her wrist. Also, a sparkling ring (probably a wedding ring) can be seen on her finger.
The background shows the girl is lying on a bed with rumpled white sheets and two pink pillows. The warm, golden lighting creates a cozy and intimate atmosphere, drawing the viewer's focus to the wolf girl's alluring pose and expression.
Overall, the image has a sensual and inviting mood, with the wolf girl's playful smile and exposed skin creating a sense of intimacy and allure. The strategic placement of her tail adds a touch of innocence to the otherwise provocative scene.
user: Describe the picture in structuted json-like format.
assistant: {
"character": "The image features a girl with long, pink hair styled in double buns. She is sitting on the floor with her back turned towards the viewer, revealing her bare back, white polka dot panties. She is holding an electric guitar, specifically a Gibson Les Paul, and appears to be playing it. Her feet are bare, and she is sitting on a black pillow. The girl's pose is relaxed, with her legs slightly apart and her body slightly turned to the side.",
"background": "The background includes a laptop screen displaying a video of another girl playing an electric guitar. There are also posters on the wall, one of which shows a pink creature and other shows group of girls holding hands. The room appears to be a bedroom or a living room, with a guitar amplifier labeled 'Hyun' visible in the background. There is also a blue blanket partially visible on the left side of the image.",
"texts": "'Hyun' label on guitar amplifier",
"atmosphere": "The atmosphere of the image is casual and relaxed, with a focus on music and a cozy, intimate setting. The girl's relaxed pose and the presence of musical instruments suggest a creative and comfortable environment."
}"
Any of this can be combined with providing booru tags to improve accuracy.
In addition to the options described you can ask questions, give tasks and perform conversations. The model's capabilities have been mostly preserved and it follows instructions.
Prompt template
<|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his task.<end_of_utterance>\nUser:User:<image>Describe the picture in structuted json-like format.<end_of_utterance>\nAssistant:
Replace user inctruction with one of mentioned above (works best) or any yours.
Inference script example:
You can use alreasy existing venv or create a new. All you need is >=torch2.1.2
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
If you are using linux or know kow to build it in windows (or use prebuilded) - also install Flash Attention-2 and uncomment corresponding line.
You will need a dev build of transformers to be able to run in (already in requirements.txt).
pip install git+https://github.com/huggingface/transformers
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
model_name_or_path="Minthy/ToriiGate-v0.3"
DEVICE = "cuda:0" #change to your device
processor = AutoProcessor.from_pretrained(model_name_or_path) #or change to local path
model = AutoModelForVision2Seq.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
# _attn_implementation="flash_attention_2", #if installed
).to(DEVICE)
image = load_image('./image.jpg') #path to your picture
###Trained options
user_prompt="Describe the picture in structuted json-like format."
#user_prompt="Give a long and detailed description of the picture."
#user_prompt="Describe the picture briefly."
###Any other questions or instructions
#user_prompt="What color is the ribbon in the character's hair?"
#...
#Optional, add booru tags
#tags='1girl, standing, looking at viewer, ...'
#user_prompt+=' Also here are booru tags for better understanding of the picture, you can use them as reference.'
#user_prompt+=f' <tags>\n{tags}\n</tags>'
messages = [
{
#Important!
"role": "system",
"content": [
{"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored. Help user with his task."}
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": user_prompt}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
caption=generated_texts[0].split('Assistant: ')[1]
print(caption)
Inference example (vllm):
vllm is an optimised LLM serving engine. Single/Batch inference will be much faster than native huggingface transformers
pip install vllm==0.6.4
!pip install huggingface_hub -U
Download model
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Minthy/ToriiGate-v0.3", local_dir="/path/to/download/model", max_workers=4)
Inference
from vllm import LLM, SamplingParams
from PIL import Image
toriigate_model = LLM(
model="/path/to/download/model",
max_model_len=8192, # reduce if running out of memory
max_num_seqs=64, # reduce if running out of memory (this is max concurrent requests)
enforce_eager=True,
# if you are running out of memory, you can reduce the "longest_edge".
# see: https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3#model-optimizations
mm_processor_kwargs={
"size": {
"longest_edge": 4 * 364
},
},
)
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
user_prompt = "Describe the picture briefly."
prompt = f"<|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his task.<end_of_utterance>\nUser:<image>{user_prompt}<end_of_utterance>\nAssistant:"
image = Image.open("/path/to/image.jpg").convert("RGB")
output = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": image},
},
sampling_params=sampling_params,
)
caption = output[0].outputs[0].text.strip()
print(caption)
batch inference
image_list = [Image.open(path).convert("RGB") for path in image_paths]
inputs = [{"prompt": prompt, "multi_modal_data": {"image": image}} for image in image_list]
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
)
captions = [x.outputs[0].text.strip() for x in outputs]
For batch processing you can use this example
Warning
Model tends to generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.
Licence
Same as for Idefics3