MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
8
Multi-modal Concept Extraction
Note The QFormer encoder E has three types of inputs: visual embeddings ξ of an image, text description l, and learnable query tokens W = [w1, · · · , wK ] where K is the number of query tokens. The outputs of QFormer are tokens O = [o1, · · · , oK ] with the same dimensions as the input query tokens.