SmilingWolf/wd-v1-4-convnextv2-tagger-v2 · Prompt Inversion (captioner/tagger upstream task)

today published work, SITTA: A Semantic Image-Text Alignment for Image Captioning
https://github.com/ml-jku/semantic-image-text-alignment
https://arxiv.org/pdf/2307.05591.pdf
CLIP text image jointed vision llms, no extra huge VIT, llama plugin, should easy adapt to danbooru database

reading note:
Language Embedding
random permutations

Language Embedding related work:

https://arxiv.org/pdf/2305.01278.pdf

Figure 5 llms logprob better than CLIP roberta contextual embedding and CNN classification head

https://arxiv.org/pdf/2306.17842.pdf

approaching way is llms understanding image, take llms token embedding map ( lexical tokens) to vqvae codebook
B.2 LLM Prompting / In-context denoising
Limitations task-specific conditions demo five choice is weak

https://arxiv.org/pdf/2304.05653.pdf

textual redundant with bucket&time emb used in SDXL

https://arxiv.org/pdf/2302.14383.pdf

linearly factored text embedding used concat in SDXL, Coca transformer mapping