|
--- |
|
license: mit |
|
language: |
|
- en |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- bert |
|
- mteb |
|
- bert.cpp |
|
- ggml |
|
--- |
|
|
|
# Model details |
|
|
|
This repository contains the files used on [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) converted to **GGML** to be used on the [bert.cpp backend](https://github.com/skeskinen/bert.cpp). |
|
|
|
> - [Text Embeddings by Weakly-Supervised Contrastive Pre-training](https://arxiv.org/pdf/2212.03533.pdf). |
|
> - Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022 |
|
> - This model has 12 layers and the embedding size is 384. |
|
|
|
|
|
--- |
|
|
|
## FAQ |
|
|
|
**1. Do I need to add the prefix "query: " and "passage: " to input texts?** |
|
|
|
Yes, this is how the model is trained, otherwise you will see a performance degradation. |
|
|
|
Here are some rules of thumb: |
|
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval. |
|
|
|
- Use "query: " prefix for symmetric tasks such as semantic similarity, paraphrase retrieval. |
|
|
|
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering. |
|
|
|
**2. Why are my reproduced results slightly different from reported in the model card?** |
|
|
|
Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences. |
|
|
|
**3. Why does the cosine similarity scores distribute around 0.7 to 1.0?** |
|
|
|
This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss. |
|
|
|
For text embedding tasks like text retrieval or semantic similarity, |
|
what matters is the relative order of the scores instead of the absolute values, |
|
so this should not be an issue. |
|
|
|
## Citation |
|
|
|
If you find our paper or models helpful, please consider cite as follows: |
|
|
|
``` |
|
@article{wang2022text, |
|
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training}, |
|
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu}, |
|
journal={arXiv preprint arXiv:2212.03533}, |
|
year={2022} |
|
} |
|
``` |
|
|
|
## Limitations |
|
|
|
This model only works for English texts. Long texts will be truncated to at most 512 tokens. |
|
|