How does the Fuyu model Get images?

#45
by VatsaDev - opened

The Q above, because from what im seeing, you take an image, split it into rows, and give that to the model, and it supossably has no real difference from permission 8b. Like how are the images going in? From what I can tell, youre not making image embeddings, so hows the model understanding images?

Hi @VatsaDev , not sure I understand your question exactly but the model does have a vision layer. It is simply linear, but it does create an embedding vector of required dimension from each patch. Then as you said the embeddings are combined with the text embeddings from the prompt tokens and fed into a Persimmon-8b like architecture.

I recommend inspecting the modeling code here to get a better sense of what the model is doing: https://github.com/huggingface/transformers/blob/9beb2737d758160e845b66742a0c01201e38007f/src/transformers/models/fuyu/modeling_fuyu.py#L154C1-L158C10

ok, so your visual layer is turning images to embeddings through an nn.linear class?

Did you really have to train it, or does image to embedding just work?

Also, Im sorry if this is too much, but im new to pytorch, learning it, could you give me code example of image -> embedding -> image?

ok, so your visual layer is turning images to embeddings through an nn.linear class?

Did you really have to train it, or does image to embedding just work?

Also, Im sorry if this is too much, but im new to pytorch, learning it, could you give me code example of image -> embedding -> image?

The linear layer has to be trained.

Sign up or log in to comment