Prompt Inversion (captioner/tagger upstream task)

by ke99L - opened

today published work, SITTA: A Semantic Image-Text Alignment for Image Captioning
CLIP text image jointed vision llms, no extra huge VIT, llama plugin, should easy adapt to danbooru database

reading note:
Language Embedding
random permutations

Language Embedding related work:

Figure 5 llms logprob better than CLIP roberta contextual embedding and CNN classification head

approaching way is llms understanding image, take llms token embedding map ( lexical tokens) to vqvae codebook
B.2 LLM Prompting / In-context denoising
Limitations task-specific conditions demo five choice is weak

textual redundant with bucket&time emb used in SDXL

linearly factored text embedding used concat in SDXL, Coca transformer mapping

ke99L changed discussion status to closed

Sign up or log in to comment