"Finally working: Redundant TEXT model for HF inference". Could you do the same thing for this LongClip?

#3
by kk3dmax - opened

I want to use this model in diffusers.
Many thanks in advance.

Done. πŸ‘
I'll just copy-paste what I just added to the readme.md:

🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: πŸ‘€

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

❌ Error due to mismatch with defined 77 tokens in Transformers library

πŸ‘‡

Option 1 (simple & worse):

Truncate to 77 tokens
CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') πŸ“‰

πŸ‘‡

Option 2 (edit Transformers) πŸ’– RECOMMENDED πŸ’–:

  • πŸ‘‰ Find the line that says max_position_embeddings=77, in [System Python]/site-packages/transformers/models/clip/configuration_clip.py
  • πŸ‘‰ Change to: max_position_embeddings=248,

Now, in your inference code, for text:

  • text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)
  • or:
  • text_input = processor([your-prompt-or-prompts-as-usual], padding="True")
# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') βœ…

thank you so much!

Still got below error message for fluxPipeline in diffusers.
Token indices sequence length is longer than the specified maximum sequence length for this model (327 > 248). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens:


It may related to diffusers codes:
removed_text = tokenizer.batch_decode(untruncated_ids[:, tokenizer.model_max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {tokenizer.model_max_length} tokens: {removed_text}"
)


However, I have forced the clip_processor.tokenizer.model_max_length = 248, still got above error message.
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")

clip_model = CLIPModel.from_pretrained(model_id)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)
clip_processor.tokenizer.model_max_length = 248
pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder

Post here, just in case someone could find a workaround.
I'll try to figure out myself.

...And while I unfortunately don't have the time to do all implementations for Forge, Diffusers and GGUF pipelines that I've received questions about myself, I'm just gonna add this link to ComfyUI nodes for Flux.1.
You could reverse engineer the implementation and apply it to your code: https://github.com/SeaArtLab/ComfyUI-Long-CLIP

I have succeded make this longclip work for Diffusers.

See the message:
Token indices sequence length is longer than the specified maximum sequence length for this model (307 > 248). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 248 tokens:

model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16

With above codes, you don't need to hack the transformers --> Option 3 (above codes for diffusers) πŸ’– RECOMMENDED πŸ’–:

Thank you for sharing this! I just updated the README.MD with this information. πŸ’– RECOMMENDED πŸ’–
πŸ˜πŸ‘

PS: And I just added the original author's LongCLIP for Diffusers today, too, if you're interested: https://huggingface.co/zer0int/LongCLIP-L-Diffusers

Sign up or log in to comment