Could you make a one-cli-mod for the new version of joy-caption?

#1
by IceHibiki - opened

Thanks for this modified version of joy-caption! It’s quite useful when processing large datasets. I’m wondering if you could update it to include the new alpha two version. It has a lot of additional functions that further improve the caption quality.

alpha two version

Thanks for the request. I just noticed it now that you mentioned it! I didn't realize they had updated it.
I'll give it a try, of course.

I made it anyway. Not sure if it works properly.
https://huggingface.co/John6666/joy-caption-alpha-two-cli-mod

I just ran a test on the new script, but it returned an error. Could you check it when you have time? Thank you!
if ids[0][-1] == tokenizer.eos_token_id or ids[0][-1] == tokenizer.convert_tokens_to_ids("<|eot_id|>"):
~~~~~~^^^^
IndexError: index -1 is out of bounds for dimension 0 with size 0

It seems I forgot to fix the newly added prompt processing part that does not support batch processing. I'm working on fixing it, but my GPU is so crappy that it's taking a long time to make sure it works. I'm sure I'll be able to fix it eventually.

Okay, the repair is complete. The original structure was no longer compatible with efficient batch processing, so we had no choice but to switch to sequential batch processing.
Still, Alpha One seems to perform better in terms of output stability...
I guess it is still an Alpha and not a stable version.

Thank you for your works! this python script worked! However, I encountered a minor issue when attempting to load the BF16 version of the unsloth-Meta-Llama-3.1-8B-Instruct model. The outputs did not align with the prompts provided. After replacing your code's loading module with the original loading module, the outputs were as expected. Once again, thank you!

Thanks. I'm having a hard time making sure it works...

Which is the original module?
I think you mean Wi-zz's.
Maybe it's because I have NF4 quantization by default in my case. It saves a quarter of the memory, but the accuracy is a bit lower. If you pass the -bf16 option, it should be bfloat16.

Sorry for the confusion, it's the module from fancyfeast/joy-caption-alpha-two. The NF4 version works fine; the issue only occurs when I try to use the bfloat16 model, maybe somehow the Lora did not load properly?Here is an example:
The sample image is this :

102061763_p0.png

my prompt: Write a very long descriptive caption for this image in a formal tone. Do NOT mention any text that is in the image.Do NOT mention the image's resolution.

When I run the script (I did pass the --bf16 option), I got an incomplete caption:
"This digital illustration features a whimsical and endearing depiction of a young girl, dressed in a cat-themed hooded costume, cradling a large, plush cat-shaped plush toy in her arms. The girl's facial expression is serene and peaceful, with her eyes closed and a subtle blush on her cheeks, conveying a sense of contentment and tranquility.

The girl's costume, which resembles a cat onesie, is a delightful shade of beige with brown ears and a hood adorned with a green leaf, adding a touch of natural elegance to the overall design. Her short, silver-blonde hair is styled in a blunt, chin-length bob with bangs, framing her heart-shaped face and emphasizing her innocent and youthful appearance.

The plush toy, which is the focal point of the illustration, is a large, spherical cat with a distinctive diamond-shaped pattern on its front, surrounded by intricate, swirling designs in shades of brown and beige. The cat's face is depicted with large, round eyes and a subtle, enigmatic smile, giving the impression of a wise and gentle companion. A sprig of greenery is nestled in the cat's hood, mirroring the one on the girl's costume and reinforcing the connection between the two.

The background of the illustration is a vibrant and playful composition of pastel hues, featuring a mix of purple, pink, and orange geometric shapes, as well as stylized cat paw prints in a deep green color. The overall effect is one of whimsy "

When I replace your codes with fancyfeast's :

def load_models():

print("Loading CLIP")
clip_processor = AutoProcessor.from_pretrained(CLIP_PATH)
clip_model = AutoModel.from_pretrained(CLIP_PATH)
clip_model = clip_model.vision_model

if (CHECKPOINT_PATH / "clip_model.pt").exists():
    print("Loading VLM's custom vision model")
    checkpoint = torch.load(CHECKPOINT_PATH / "clip_model.pt", map_location='cpu',weights_only=False)
    checkpoint = {k.replace("_orig_mod.module.", ""): v for k, v in checkpoint.items()}
    clip_model.load_state_dict(checkpoint)
    del checkpoint

clip_model.eval()
clip_model.requires_grad_(False)
clip_model.to("cuda")

Tokenizer

print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=False)
assert isinstance(tokenizer, PreTrainedTokenizer) or isinstance(tokenizer, PreTrainedTokenizerFast), f"Tokenizer is of type {type(tokenizer)}"

LLM

print("Loading LLM")
if (CHECKPOINT_PATH / "text_model").exists:
    print("Loading VLM's custom text model")
    text_model = AutoModelForCausalLM.from_pretrained(CHECKPOINT_PATH / "text_model", device_map=0, torch_dtype=torch.bfloat16)
else:
    text_model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16).eval()

Image Adapter

print("Loading image adapter 🖼️")
image_adapter = ImageAdapter(clip_model.config.hidden_size, text_model.config.hidden_size, False, False, 38, False)
image_adapter.load_state_dict(torch.load(CHECKPOINT_PATH / "image_adapter.pt", map_location="cpu", weights_only=False))
image_adapter.eval().to("cuda")

        
return clip_processor, clip_model, tokenizer, text_model, image_adapter

I got this caption:
"This is a digital drawing in an anime style featuring a chibi character, a small, cute, and stylized depiction typical in Japanese animation. The character is a young girl with short, light blonde hair and bangs that cover her forehead. She has fair skin and is wearing a hooded, tan-colored outfit with cat ears, suggesting a playful, animalistic theme. The hood has a leaf motif on the top, and the character's eyes are closed, giving her a peaceful or sleepy expression. Her cheeks are slightly flushed, adding to the innocent and cute appearance.

The character is holding a large, round, plush toy resembling a cat, which has a similar tan and brown color scheme. The plush toy has large, expressive eyes with a somewhat sad or tired look, and it has a leaf on its head, matching the leaf on the character's hood. The background is a soft, pastel mix of purple, pink, and beige with abstract shapes and green paw prints, enhancing the whimsical and playful atmosphere of the image. The overall composition is balanced and visually appealing, with a focus on the character and her plush toy."

And this is similar to the result I got from https://huggingface.co/spaces/fancyfeast/joy-caption-alpha-two

Thanks for the details! I think I know the cause thanks to you.

The only difference is this: device_map=“auto”,.
So, the Zero GPU space has 40GB of VRAM, so loading all of BF16 into CUDA would not cause any problems, but on a personal PC, it's not so easy, and there must be some tensors that are more convenient to offload to the CPU.

I think I can fix it by typing just 4 x 2 characters.

Anyway, I set device_map="auto" for BF16. Not all tensors are in CUDA, so there could be some LoRA-related errors, but this would be better overall.

okay.. It seems I’ve identified the issue, though I don't fully understand why it occurs. The LLM and LoRA need to be loaded in the specific way as fancyfeast's did :

        text_model = AutoModelForCausalLM.from_pretrained(CHECKPOINT_PATH / "text_model", quantization_config=nf4_config, device_map=device, torch_dtype=torch.bfloat16).eval()  #for NF4
        text_model = AutoModelForCausalLM.from_pretrained(CHECKPOINT_PATH / "text_model", device_map=device, torch_dtype=torch.bfloat16).eval()  #for bf16

And I think it might relate to the adapter_config.json. The LoRA only seems to work properly with the "unsloth/Meta-Llama-3.1-8B-Instruct" model.
In this case, the NF4 mode consumed around 9200 MB of VRAM, while the bf16 mode consumed about 18700 MB of VRAM.

I think this has something to do with the HF specifications. The README.md he put with the adapter is not an instruction manual, but a YAML to instruct the redirection to unsolth's one.
So it actually loads unsolth's one.
By the way, in Alpha One it was MetaLlama's one, but he changed it because the gated model was annoying.

The other thing is that it looks like LoRA needs to be merged rather than applied as an adapter. This should be fine as long as the necessary keys remain on the model side.
This is something I will try today, but I am sure I will hit on one of them, since what the HF server is doing is just also essentially a PEFT and Transformers operation.

https://huggingface.co/spaces/John6666/joy-caption-pre-alpha-mod/
For the time being, I switched the default for the GUI version to loading with a function that might be used for loading on HF's server. When the demo starts, the model is loaded with BF16 and LoRA is also merged.

In the case of NF4, it will not work unless I take the steps of once switching the model back to BF16, applying LoRA, and then quantizing to NF4. I searched and found that this is probably a bug in PEFT. Doing this seems to be bad for loading speed and memory consumption, so I simply turned off LoRA when loading in NF4.
But maybe it's my imagination, but I think NF4 has better output performance when the same LoRA is not applied. It can't be...

If it works well in the GUI version, I will implement the same loading method for BF16 loading in the CLI version.

At any rate, I updated both CLI versions since the bugs don't seem to be enough to break the output.

But maybe it's my imagination, but I think NF4 has better output performance when the same LoRA is not applied. It can't be...

I kind of feel the same way. To me, the LoRA only seems to adjust the output in a specific format. It indeed performed better without the LoRA.

Okay, so perhaps LoRA is not currently used to augment knowledge, but to stabilize output. It will probably be used to reinforce knowledge in the future. Because Danbooru tags and other things in the options have to be taught to reach a practical level. I think it would be easier to use WD Tagger for Danbooru in practical use...

Anyway, at this point, it may be easier to find a better base model to get a better output.
Also, the 8B model derived from Llama 3.1 seems to have no problem applying LoRA.

The 3B model from Llama 3.2 has a different rough structure before LoRA, which caused an error in ImageAdapter. When the author changes the language model in the future, the ImageAdapter will probably be changed as well.

Sign up or log in to comment