Spaces:
Running
on
Zero
This demo is way better than the fancyfeast's one (original)
Hello!
I tested both the demos and this one is way better than the original one from fancyfeast. I think that this is because of the model used here: bunnycore/LLama-3.1-8B-Matrix.
I can confirm that this demo (running through Gradio share link) works good on Tesla T4 GPU with 16 GB VRAM. I wonder if there are other techniques/methods which can be used to make it faster on that GPU (except the quantization which I already know about). This can really help the normal people to generate good captions for their images datasets to train fine-tune stable diffusion-like models.
Is it possible to add an option for batch images processing? Tried with all the LLMs/chatbots like ChatGPT o1/Claude 3.5 Sonnet/Qwen 2.5, but they couldn't do it and I'm no coder unfortunately!
Keep up the good work!
Nevermind, I figured out how to modify the app to insert batch images process function.
This still remains: I wonder if there are other techniques/methods which can be used to make it faster on that GPU (except the quantization which I already know about).
Hello. It's just 6AM here. Sorry, I was sleeping.
This mod is basically a selectable version of another Llama3 model.
I decided to make it in response to feedback lamenting the fact that the original Llama3 is strictly NSFW regulated and that another model could not be used.
bunnycore's models are usually excellent.
As for the logic part, I simply added NF4 quantization to the original code...
Also I tried to include GGUF loading, but it was too incompatible with the current transformers and I gave up for now.
I was surprised to hear it was fast because I didn't use any special techniques for speeding up purposes in the first place.
But in this space, to avoid the Zero GPU space bug, I load the model into RAM at first and then move it to VRAM just before inference.
If we load all of this into VRAM from the beginning, we might be able to speed up the process somewhat. If I do this in my Zero GPU space at this very moment, though, I won't be able to switch models after startup.
I forget if it does NF4 quantization at runtime, but the following tool, which I also referred to, is effectively a pre alpha to be used locally for batch processing, etc. I wonder if you could get the same or better performance by making an alpha one version of this and adding runtime NF4 quantization?
Maybe this author will make one eventually, but I wouldn't mind doing it myself, at least modifying it to try it out. However, my GPU is so crappy that I can't test it locally.
https://huggingface.co/Wi-zz/joy-caption-pre-alpha
So here's how you can try everything. I made it to work on Google Colab (on a Tesla T4 GPU with 16 GB VRAM) - you get 3-4 free hours per day on google colab.
Copy and paste these commands into a new google colab notebook (while being sure that you change the runtime to a T4 GPU, not on CPU) -> I tell you all this, cuz I'm in the same situation and I can't run all these models locally due to outdated hardware.
Lines of code to copy into google colab:
!git clone https://huggingface.co/spaces/John6666/joy-caption-pre-alpha-mod
!pip install -r /content/joy-caption-pre-alpha-mod/pre-requirements.txt
!pip install -r /content/joy-caption-pre-alpha-mod/requirements.txt
!pip install gradio spaces
!huggingface-cli login
!python /content/joy-caption-pre-alpha-mod/app.py
Copy your HF token and paste it after running this command "!huggingface-cli login" and before running the app.py go to https://huggingface.co/dominic1021/joycaption_alpha_one/blob/main/app.py and copy and paste/replace the code from the /content/joy-caption-pre-alpha-mod/app.py with the one from my HF repo (I just integrated everything into a single app.py, so joycaption.py is no more needed).
After that just run "!python /content/joy-caption-pre-alpha-mod/app.py" and you'll get a gradio share link after 2-5 mins, open it and voila -> here's the working app!
I tested it, the loading of the other methods work as well (not like in the HF spaces where I got errors when trying to do that).
I'll have to find a way to insert batch images processing, but I'm out for now. Can't believe I lost so much time to make everything work in Colab with the help of LLMs!!!
They are so bad at helping you to code atm... Even with Cursor AI.
Peace and keep up the good work! Hopefully google colab method given by me will help you to experiment daily in those 3-4 free hours provided by google. :)
If you have more questions about google colab, don't hesitate to ask. I know pretty much all the important things about these cloud GPUs providers.
Also, I'm still wondering if it's possible to implement more methods to optimize the usage on 16 GB VRAM GPUs (besides quantization). I found some: Model Pruning, Gradient Checkpointing, Efficient Attention Mechanisms, Model Sharding / Parallelism, Dynamic Offloading, Activation Checkpointing, Flash Attention, Continuous Batching. I don't really know how to apply these methods and I don't know which ones are the best, just found them by researching. :)
Will leave an image with the app working on colab.
I made it to work on Google Colab (on a Tesla T4 GPU with 16 GB VRAM) - you get 3-4 free hours per day on google colab.
Thanks for the useful information. I especially didn't know about the specific free hours information and the ability to use GPUs in that class. There are so many possible uses with that much.
That said, I just created a CLI version. Don't know if it will work or not.
If you can, please just give it a quick test run.
https://huggingface.co/John6666/joy-caption-alpha-one-cli-mod
Everything is the same except for app.py and README.md.
Model Sharding / Parallelism,
The former is faster when the model is super huge and the latter in a multi-GPU environment, but not so much otherwise.
Dynamic Offloading,
This is also mainly for when the model is huge and VRAM is not enough.
Activation Checkpointing, Flash Attention
There was indeed a flash attention!
This could be useful, and it's easy to enable, but I actually don't know much about the content.😅
I'll give it a try later.
I made it to work on Google Colab (on a Tesla T4 GPU with 16 GB VRAM) - you get 3-4 free hours per day on google colab.
Thanks for the useful information. I especially didn't know about the specific free hours information and the ability to use GPUs in that class. There are so many possible uses with that much.
That said, I just created a CLI version. Don't know if it will work or not.
If you can, please just give it a quick test run.
https://huggingface.co/John6666/joy-caption-alpha-one-cli-mod
Everything is the same except for app.py and README.md.Model Sharding / Parallelism,
The former is faster when the model is super huge and the latter in a multi-GPU environment, but not so much otherwise.
Dynamic Offloading,
This is also mainly for when the model is huge and VRAM is not enough.
Activation Checkpointing, Flash Attention
There was indeed a flash attention!
This could be useful, and it's easy to enable, but I actually don't know much about the content.😅
I'll give it a try later.
Alright, gonna try this and leave a feedback afterwards.
By they way, it will be best if you edit the code to make "image_adapter.pt" to be globally found, so you don't have to manually insert anymore its directory for the app to work. I've already done that here: https://huggingface.co/dominic1021/joycaption_alpha_one/blob/main/app.py
The error you get otherwise:
"Loading image adapter 🖼️
Error loading models: [Errno 2] No such file or directory: '9em124t2-499968/image_adapter.pt'"
Feedback: Tried it and only this command seems to work:
python app.py image.jpg Process a single image
Also, the model of unsloth is not really good at uncensored things.
https://huggingface.co/dominic1021/joycaption_alpha_one/blob/main/app.py
I've tried to reflect this, but I'm not sure if this will work.
Thanks though, I appreciate it. I didn't know you could declare global variables side by side in python when I really wanted to write it in C style... now I know I can do it!😭
I also made it Python 3.9 compliant. I don't remember why myself, but my local environment is still on 3.9, and it's too much of a pain to migrate libraries to 3.10 or 3.11.
Feedback: Tried it and only this command seems to work:
python app.py image.jpg Process a single image
Thank you so much.
Is there any tag content missing? Actually, there are some portability difficulties, and it's a bit Frankensteinian.
Pre Alpha
images = clip_processor(images=batch, return_tensors='pt', padding=True).pixel_values.to('cuda')
Alpha One
all_images = [] # The part I added
for input_image in batch: #
image = input_image.resize((384, 384), Image.LANCZOS)
pixel_values = TVF.pil_to_tensor(image).unsqueeze(0) / 255.0
pixel_values = TVF.normalize(pixel_values, [0.5], [0.5])
all_images.append(TVF.to_pil_image(pixel_values.squeeze())) #
batch_pixel_values = clip_processor(images=all_images, return_tensors='pt', padding=True).pixel_values.to(device) #
You get this error when trying to use the batch processing command:
Traceback (most recent call last):
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 369, in
main()
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 355, in main
process_directory(input_path, output_path, batch_size, models)
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 288, in process_directory
captions = stream_chat(batch_images, batch_size, pbar, models)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
TypeError: stream_chat() missing 3 required positional arguments: 'batch_size', 'pbar', and 'models'
I knew the cause and solution as soon as I saw it, and I was able to fix it almost entirely with the mouse.
I had a hard time dealing with the batch processing of the logic part, and I didn't realize I had forgotten that part...
If it works now, I may or may not add temperature and so on.
Will test soon the updated app.py.
I was thinking if there's any way possible to make all this joycaption captioner like the smilingwolf's tagging models. These models can run even on CPU and they are really fast and good at tagging/captioning images. Idk how they all work, but I think he used/uses a LLM behind after all like joycaption uses llama 3.1.
I'm a beginner into all this, so I don't even know if what I said actually makes any sense.
https://github.com/KichangKim/DeepDanbooru
I'm not familiar with NLP either, but the lightweight WD Tagger is a bit like an old VLM, and WD Tagger's ancestor was DeepDanbooru, which is less than 6 years old.
JoyCaption uses VLM plus LLM for text decoration, which is very popular right now. This one has both modifications.
I think models trained with both the CLI version of pre alpha and WD Tagger were already starting to appear.
That's the main use for the CLI version, so ComfyUI and WebUI users will join in the improvements and the environment will be ready soon.
JoyCaption is using Google's VLM for now, but looking at the extraordinary performance of Qwen2.5's LLM, he may switch when 2.5's VLM comes out. Microsoft's Florence2 is also very nice, but it pales a bit.
Florence 2 is highly censored and as for Qwen 2.5, guess we can use the vLLM locally only with the quantized versions or the 7B one.
I don't know if you heard about it, but Pixtral is a really good vLLM worth considering with 12B parameters. It gives even more accurately detailed captions for some images compared with the actual joycaption depending on what command prompt you use for it as instruction. Think we can use this after it'll be quantized, tried to quantize it myself with gguf-my-repo and it didn't work, because it doesn't have the usual format (tekken.json instead of config.json and so on).
https://huggingface.co/spaces/aixsatoshi/Pixtral-12B
Florence2 is rather useless for NSFW purposes... I didn't know that.
Qwen is a bit looser on that point in general and the latest LLM is very good, so I had high hopes for it, but alas.
I've seen Pixtral's name. But even for those of us who are into GGUF quantization as a hobby, it looks like we're still waiting for Llamacpp to be supported. So I haven't actually tried using it yet.
My LLM environment is practically HF only.
I just checked and the author of Pixtral is Japanese.
I don't know why there are so many Japanese in the tagger area...me too.
It's convenient for me, I guess.
It wasn't the author, it was the space author. Well, either way, it's a good haul. But tekken is probably the Japanese word for iron fist...
Mistral is promising.
AI-related Japanese language resources are scarce.
Yep, Pixtral is legit something else, made it to write literally everything from an image (describing every single thing from the character, hair to background, artstyle, everything). Unfortunately the 12B model doesn't work in Google Colab with T4 16 GB VRAM, it just runs out of VRAM/RAM, so I didn't have time to experiment with it. And the HF spaces with pixtral don't have batch processing, so it's really hard to test like this having to manually drag and drop each image after the previous one was captioned.
I looked at the code to see if I could implement it a bit and turned back. No, this is indeed better to wait for official transformers support...😅
Llamacpp users must have reached a similar conclusion.
I wish I could get a slot for the Zero GPU space, but I'm already at the edge of 10 slots, so it's tough for me unless the structure can live with the others.
At least with the official transformers it would be easy to co-locate with this guy...
The Enterprise plan is also 10 slots as well, which doesn't help much. I'm waiting for them to expand the individual plans.
https://huggingface.co/spaces/aixsatoshi/Pixtral-12B/blob/main/app.py
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunk
from mistral_common.protocol.instruct.request import ChatCompletionRequest
https://huggingface.co/mistral-community/pixtral-12b/discussions/7/files
Oh, something just now merged the transoformers description. I wonder if this is a course to live together with JoyCaption.
Hopefully we can use this as soon as possible. Maybe with llama.cpp if not with gguf.
I made it to work on Google Colab (on a Tesla T4 GPU with 16 GB VRAM) - you get 3-4 free hours per day on google colab.
Thanks for the useful information. I especially didn't know about the specific free hours information and the ability to use GPUs in that class. There are so many possible uses with that much.
That said, I just created a CLI version. Don't know if it will work or not.
If you can, please just give it a quick test run.
https://huggingface.co/John6666/joy-caption-alpha-one-cli-mod
Everything is the same except for app.py and README.md.Model Sharding / Parallelism,
The former is faster when the model is super huge and the latter in a multi-GPU environment, but not so much otherwise.
Dynamic Offloading,
This is also mainly for when the model is huge and VRAM is not enough.
Activation Checkpointing, Flash Attention
There was indeed a flash attention!
This could be useful, and it's easy to enable, but I actually don't know much about the content.😅
I'll give it a try later.
Tried it again and after running this command:
python /content/joy-caption-alpha-one-cli-mod/app.py /content/images
I got the following error:
Traceback (most recent call last):
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 369, in
main()
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 355, in main
process_directory(input_path, output_path, batch_size, models, caption_type, caption_tone, caption_length)
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 288, in process_directory
captions = stream_chat(batch_images, batch_size, pbar, models, caption_type, caption_tone, caption_length)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/joy-caption-alpha-one-cli-mod/app.py", line 184, in stream_chat
clip_processor, clip_model, tokenizer, text_model, image_adapter = models
ValueError: not enough values to unpack (expected 5, got 3)
It still works good with one image, but not with multiple within a directory...
Hmm? What's this error? I thought to myself, “This is awful". Easy mistake.
I passed the positional arguments to the function in the wrong order. I fixed it. (hopefully)
I was able to use Pixtral with model switching as described in the repo, but found that it is not yet implemented in AutoModel in transformers and does not work. I omitted it.
It's normal for repo descriptions to contain lies, plans, wishes, etc.
PIXTRAL_PATH = "mistral-community/pixtral-12b"
llm_models = {
"bunnycore/LLama-3.1-8B-Matrix": None,
#PIXTRAL_PATH: None,
"Sao10K/Llama-3.1-8B-Stheno-v3.4": None,
"unsloth/Meta-Llama-3.1-8B-bnb-4bit": None,
"DevQuasar/HermesNova-Llama-3.1-8B": None,
"mergekit-community/L3.1-Boshima-b-FIX": None,
"meta-llama/Meta-Llama-3.1-8B": None, # gated
}
I was able to use Pixtral with model switching as described in the repo, but found that it is not yet implemented in AutoModel in transformers and does not work. I omitted it.
It's normal for repo descriptions to contain lies, plans, wishes, etc.
PIXTRAL_PATH = "mistral-community/pixtral-12b" llm_models = { "bunnycore/LLama-3.1-8B-Matrix": None, #PIXTRAL_PATH: None, "Sao10K/Llama-3.1-8B-Stheno-v3.4": None, "unsloth/Meta-Llama-3.1-8B-bnb-4bit": None, "DevQuasar/HermesNova-Llama-3.1-8B": None, "mergekit-community/L3.1-Boshima-b-FIX": None, "meta-llama/Meta-Llama-3.1-8B": None, # gated }
Yea, unfortunately there are so many lies about these AI applications so that they build up a fake hype...
Think that Llama 3.1 8B Lexi Uncensored V2 is better than the unsloth's one you're using atm.
https://llm.extractum.io/list/?uncensored
Think that Llama 3.1 8B Lexi Uncensored V2 is better than the unsloth's one you're using atm.
I set this one as the default value. The performance of this model has been confirmed in my Llamacpp space previously.
It seems to work well with JoyCaption's LoRA.
In the case of CLI scripts, the default value is the NF4 version because the file size may be important and the original author's intention is unknown.
Instead, I added a line to the comment.
There is an AWQ version, but I did not use it this time because I would have had to mess with its dependencies.
I could have quantized and uploaded the NF4 version myself, but I'm sure there would be better models out there than this eventually.
I committed the CLI version because there was nothing more to tweak for now.
Can you explain what your code contains in app.py, so that it makes JoyCaption to work even on 16 GB VRAM while the fancyfeast's code gives out of memory errors?
I tested the code from this space a few days ago and it worked without problems.
Is that because you made it to load the shards of the vLLM in order? Or is there something else?
That's the one not working: https://huggingface.co/spaces/fancyfeast/joy-caption-alpha-one/blob/main/app.py -> someone told me that this is not working, because it tries to load the CLIP and the Llama LLM at the same time or kinda.
Can you explain what your code contains in app.py, so that it makes JoyCaption to work even on 16 GB VRAM while the fancyfeast's code gives out of memory errors?
In fancyfeast code, whether loading a 4-bit quantized model or a normal model, it is deployed in bfloat16 on VRAM. This takes up 16 bits of VRAM.
In the case of my code, I do on-the-fly quantization with bitsandbytes when loading, so it only occupies 4 bits on VRAM. In the actual calculation, it is set to be expanded to bfloat16 once.
This is just a standard feature of bitsandbytes and transformers; Diffusers are not supported yet, but they will be eventually.
Of course, the accuracy will be lower than if everything is done in bfloat16. But as you can see from the example of Flux, an image generation AI, the results are surprisingly indistinguishable practically.
https://huggingface.co/posts/bartowski/928757596721302.
And since fancyfeast is written to run this on Zero GPU space with 40GB of VRAM, there is nothing wrong with bfloat16.
However, normal people don't have such a GPU, so it won't run due to lack of VRAM.
Thanks for the explanation!
I studied this quantization method with transformers and bitsandbytes. It seems that it can be done only after the model supports it (so you have to wait for the devs to implement some new things).
What about if a new LLM model launches today and transformers and bitsandbytes don't support it yet? Is there any other way to quantize that model on its first release day without waiting for some other tools to get eventually updated?
Or a better question will be: is there a quantization method which can be applied out-of-the-box to almost every single LLM/vLLM model on the release day without depending on others to make some things to support the newest LLMs?
Thanks for the explanation!
I studied this quantization method with transformers and bitsandbytes. It seems that it can be done only after the model supports it (so you have to wait for the devs to implement some new things).
What about if a new LLM model launches today and transformers and bitsandbytes don't support it yet? Is there any other way to quantize that model on its first release day without waiting for some other tools to get eventually updated?
Or a better question will be: is there a quantization method which can be applied out-of-the-box to almost every single LLM/vLLM model on the release day without depending on others to make some things to support the newest LLMs?
An example for this could be Pixtral which was launched like 2 weeks ago and only yesterday or today they made transformers to support it.
bitsandbytes is originally a separate library, and you should be able to quantize and dequantize any torch tensor basically.
For transformers, as long as you can load by from_pretrained in NF4, you should only need one line of save_pretrained. freeze(model) may be necessary, but that's about it.
If you want to do it manually with your own model or something, the method in the URL below will also work in NF4. The sample is 8 bits, though.
https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/quantization
https://huggingface.co/blog/hf-bitsandbytes-integration
Maybe, but it's just not merged into the main revision of the github transformers, and you can install the corresponding version by specifying the hash of the commit directly.
I don't think an official would put something completely unimplemented in the description.
P.S.
It's in the HF instructions...it means it's merged into main. I honestly have no idea why it doesn't work. Maybe the version of transformers in my space is out of date, pulled in by other library dependencies.
https://huggingface.co/docs/transformers/main/en/model_doc/pixtral
Cause identified. Terrible mistake. The branching when loading the model was buggy. Specifically, I forgot to return.
Progress report. successfully loaded the model in the quantized state of Pixtral. It would be possible to save as well. I won't do it because it's already been done.
No matter how I try to load the processor, it will not load and returns None without error. How is such behavior possible when the base class is supposed to be inherited in the library...?
It is possible that the repo config is incorrect or I am making some other easy mistake, but anyway, the only phenomenon is that the model can be loaded but only the processor cannot be loaded and if I refer to the instance, it returns None.
But still, Llama 3.2 performs well, and with Instruct, even 3B returns a response in line with the requested format. It will be worth switching to NSFW support when it is made available, since it is highly regulated anyway.
I understand what you're saying, but I was wondering if there's any method/way to be able to quantize a model immediately after its release.
Example:
"On the 4th of October 2024 Llama 3.5 will be launched at 7:00 AM. Is there any way to quantize it by 8:00 AM (for example) without having to wait for some libraries to support it? A quantization method which can be immediately applied to one LLM after launch. Or this is something for the more advanced coders/devs?".
Progress report. successfully loaded the model in the quantized state of Pixtral. It would be possible to save as well. I won't do it because it's already been done.
No matter how I try to load the processor, it will not load and returns None without error. How is such behavior possible when the base class is supposed to be inherited in the library...?
It is possible that the repo config is incorrect or I am making some other easy mistake, but anyway, the only phenomenon is that the model can be loaded but only the processor cannot be loaded and if I refer to the instance, it returns None.But still, Llama 3.2 performs well, and with Instruct, even 3B returns a response in line with the requested format. It will be worth switching to NSFW support when it is made available, since it is highly regulated anyway.
This code works well on Tesla T4:
from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time
Load model
model_id = "SeanScripts/pixtral-12b-nf4"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
use_safetensors=True,
device_map="cuda:0"
)
Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)
Caption a local image
IMG_URLS = [Image.open("test.png").convert("RGB")]
PROMPT = "[INST]Caption this image:\n[IMG][/INST]"
inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")
t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=512)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
This code works well on Tesla T4:
So that's my mistake.🤢 But I will not know what I did wrong until I find out what it is.
If both of them are None, then it's a mistake in branching or declaration. What is the situation where only one of them goes to None...?
Well, it's not a vast code, so it's probably a trivial reason as it turns out.
Cause identified.
Zero GPU space specific bug. It seems that in functions with @spaces decorators, the type is used when defined in the global scope and is not reflected even if the type changes afterwards.
So if I load both at startup, I can use both, but the VRAM consumption is terrible.
I have been struggling with the Zero GPU space bug for a month now, but it may be an unseen bug.
I'll have to report this again.😅
Hi👋 Can you write how I can run it on local ps?
I copied the repository and installed the requirements but I have a zero gpu error.
I'll try fix it. Commenting out spaces and @spaces is the quickest way, but it's a pain if it doesn't work as it is.