Embedding creation

#6
by C-Stuti - opened

How are you creating the embeddings here

Sentence Transformers org

The query embeddings are created on the fly here, and the corpus embeddings are created with this script for binary and this script for int8. For these two scripts, the embeddings are already created by calling the encode method of the Sentence Transformer model: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1#sentence-transformers

  • Tom Aarsen

The above script doesn't show how the ivf binary indexes are created which is used when doing approximte search in the demo. Could you please share that also

Sentence Transformers org

Apologies, I missed your message. The indices are created with these scripts:
https://huggingface.co/spaces/sentence-transformers/quantized-retrieval/blob/main/save_binary_index.py
https://huggingface.co/spaces/sentence-transformers/quantized-retrieval/blob/main/save_int8_index.py

Except then with "50m" instead of "1m" and "mixedbread-ai/wikipedia-embed-en-2023-11" instead of "mixedbread-ai/wikipedia-2023-11-embed-en-pre-1". (Although 50m is a bit of a misnomer, I was under the impression that it was 50m embeddings, but it's 41m).

  • Tom Aarsen

Sign up or log in to comment