Zero-Shot Image Classification
Safetensors
clip
zer0int's picture
Update README.md
cfec7d6 verified
metadata
datasets:
  - zer0int/CLIP-adversarial-typographic-attack_text-image
  - SPRIGHT-T2I/spright_coco
base_model:
  - BeichenZhang/LongCLIP-L
pipeline_tag: zero-shot-image-classification

Long-CLIP ViT-L/14 finetune: SAE-informed adversarial training

image/png

The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo:

69 tokens, normal scene:

  • Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop.

52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept.

  • In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.

image/png