Prompt Inversion (captioner/tagger upstream task)
today published work, SITTA: A Semantic Image-Text Alignment for Image Captioning
https://github.com/ml-jku/semantic-image-text-alignment
https://arxiv.org/pdf/2307.05591.pdf
CLIP text image jointed vision llms, no extra huge VIT, llama plugin, should easy adapt to danbooru database
reading note:
Language Embedding
random permutations
Language Embedding related work:
https://arxiv.org/pdf/2305.01278.pdf
Figure 5 llms logprob better than CLIP roberta contextual embedding and CNN classification head
https://arxiv.org/pdf/2306.17842.pdf
approaching way is llms understanding image, take llms token embedding map ( lexical tokens) to vqvae codebook
B.2 LLM Prompting / In-context denoising
Limitations task-specific conditions demo five choice is weak
https://arxiv.org/pdf/2304.05653.pdf
textual redundant with bucket&time emb used in SDXL
https://arxiv.org/pdf/2302.14383.pdf
linearly factored text embedding used concat in SDXL, Coca transformer mapping