MRL Truncation + BQ on Vision embeddings?

#5
by bulb-infmind - opened

I understand that the nomic-embed-vision-v1.5 is trained on an embedding space which is aware of nomic-embed-text-v1.5 by way of LiT like training where instead of freezing the vision embedder, you freeze the text embedder. This is done so that it is possible to retrieve images.

But I have the following questions:

  1. One important feature of nomic-embed-text-v1.5was its MRL (Matryoshka Representation Learning) form where we can choose to truncate the output embedding to a desired smaller size by sacrificing the minimum amount of quality possible. Is it also possible to drop the later dimensions of the nomic-embed-vision-v1.5 output and still obtain an embedding with reasonable discriminative ability?

  2. Another important ability of nomic-embed-text-v1.5 was the ability to conduct Binary Quantization. How does this translate to the vision embeddings?

Some experiments also showed ability to do both - MRL + BQ for the nomic-embed-text-v1.5 and still obtain good quality performance on retrieval tasks. How does this combination translate to the nomic-embed-vision-v1.5? Since you are essentially treating both in the same space is using both of these features also valid in the vision case?

Nomic AI org

Sorry for such a delay on this, this slipped my mind! We were hopeful that these properties would transfer but in our initial tests we saw poor results with both Matryoshka and BQ on the vision embeddings. There's definitely better ways to enforce this and more time could be spent on it but we didn't get time to unfortunately.

Hi @zpn ,
Not at all; even the ability to simply embed visual modality in a similar vector-space as textual embeddings is great, because it allows users of former to upgrade to this space without recomputing all previous textual embeddings.

Maybe this has to do with the fact that LiT training is trying to bring down the contrastive loss function of difference between the embeddings produced by two embedders. which is great for similarity search, but this does not mathematically guarantee that both embedders project input to exactly same output vector space (which would guarantee the transferability of above properties).

I wonder if this can be done. That would be an interesting research paper.
In any case, thank you for the reply. Appreciate it.

bulb-infmind changed discussion status to closed

Sign up or log in to comment