Going the *other* way?

#2
by inflatebot - opened

VLMs are all the rage right now, and there's a lot of data out there that's made for training Stable Diffusion models with their captions in text format. If we go from tags to natural language, it could be possible to automatically convert that data into something usable for finetuning VLMs with.

Aside from the fact that I'm not a data scientist, I'm not even very familiar with LLM, being a neophyte who started with Command R+ and Animagine.

However, if the vast amount of data from Danbooru and Gelbooru and the output from WD Tagger were smoothly available in natural language, it would be easier than it is now to create animation-oriented natural language image generation models. (The only hard part is earning money for GPU rental for model training)

For non-animation applications, I think it would be faster to leave it to the JoyCaption author, but I'll do some experimenting as well.
https://huggingface.co/Wi-zz/joy-caption-pre-alpha
https://huggingface.co/spaces/fancyfeast/joy-caption-pre-alpha

From then on, I will use this place as a working memo.

It would be somewhat effective to load Danbooru tags directly into an existing LLM and let them compose the text, but it would be better to simply merge the output of WD Tagger with the output of JoyCaption or Florence2. (Several model authors have actually done so.)

If you want to get good output, it is preferable to first create a dictionary that converts Danbooru tags into ordinary English words, and then pass the list of words converted by a program using the dictionary to LLM.
So first, I looked around for a dataset that I could use, but apparently a simple correspondence type does not exist or cannot be found. I found, or rather knew about Japanese and Chinese, but when I tried to express them in English, the details were detailed but redundant.

So I may have to start by creating a dataset. I guess the order is: make a dataset, make a dictionary, make a program....
https://huggingface.co/datasets/isek-ai/danbooru-wiki-2024
https://huggingface.co/datasets/p1atdev/danbooru-ja-tag-pair-20240715

I realise I said "in text format" when what I meant is "in TAGS format". So what I'm looking for is basically this space, but flipped around, where a list of TAGS are given as input, and then it produces a short string caption based on those tags. But I realize, in hindsight, that without the context of the actual image, that wouldn't really work all that well. >.<

JoyCaption picked an insufficient text model for this task, IMO. L3.1 is too censored for the "not-for-all-audiences" nature of a lot of this data, and if you throw anything kinda niche at it, it falls apart. Maybe they could make something slightly more usable with some tag guidance, but we'll have to see, I suppose.

without the context of the actual image, that wouldn't really work all that well.

Yes, yes. There is no way to match the VLM or its hybrids in that regard. So half of the uses of this LLM will be like generating short stories. (NSFW specific! Otherwise, use GPT4o or Claude3 Opus.)

JoyCaption picked an insufficient text model for this task, IMO. L3.1 is too censored for the "not-for-all-audiences" nature of a lot of this data,

I see. That's important, but I didn't know that.
There's a big room for improvement that might be possible even for me.
Maybe it would be faster to modify JoyCaption for the power-up application of tagging by VLM.

Let's explore the possibilities. However, I've never tuned a VLM or LLM, so if I can't do anything about it by tinkering with the code, I plan to give up quickly.

As it turns out, after looking through the source code, modifying JoyCaption is probably easy.

It is the VLM part that is customized, and that can be diverted because the main part is the adapter. It is a smart design.
The LLM part is pure LLAMA3 8B. No customization. Well, it is a pre-alpha.
The VLM and LLM sections are almost completely separated.
If you just want to swap the LLAMA3 model, you only need to write one line.

I got the impression that many people do that locally and silently.
If there is a HF LLAMA3 series model repo with 4-bit quantization as per the bitsandbbytes manual, the local version should be able to be modified in the same way.

Of course, there are many issues, such as the need for fine parameter adjustment for each LLM model, and the fact that it is not HF's Serverless Inference, but a type that runs directly on the GPU of the space, so Zero GPU space is required and Quota is a concern.
But anyone who can program can modify it.
If he/she is familiar with chat templates, he/she should be able to use various models. (I don't know much about it, but I could at least look it up)

Well, I'll try to exchange within the same LLAMA3 format model first, since aiming too far from the start often leads to brittle results. Maybe tomorrow, since it's already late.

https://huggingface.co/spaces/John6666/joy-caption-pre-alpha-mod
So I quickly made a modified version, using Mr. Sao's model as default, but I wonder what the NSFW performance is. I also made it possible to add models as well as my other spaces, but the model adding operation is slow due to a bug related to the Zero GPU space. This bug seems to be hard to solve...

https://huggingface.co/spaces/KBlueLeaf/TITPOP-DEMO
https://huggingface.co/KBlueLeaf/TITPOP-200M-dev
KBlueLeaf's space is still in pre-alpha and the internals are undisclosed, but it appears to be largely about converting Danbooru tags to natural language.
If he makes it, all we have to do is wait for it to be completed, then copy and rearrange it as needed.

Sign up or log in to comment