MTEB reproduction
Hi! I wanted to add your model to MTEB so that everyone can easily run it using the platform. I used the following prompts for your model (full code here):
STELLA_S2S_PROMPT = "Instruct: Retrieve semantically similar text.\nQuery: "
STELLA_S2P_PROMPT = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
STELLA_PROMPTS = {
"query": STELLA_S2P_PROMPT,
"passage": "",
"STS": STELLA_S2S_PROMPT,
"PairClassification": STELLA_S2S_PROMPT,
"BitextMining": STELLA_S2S_PROMPT,
}
I've obtained similar results for Pair Classification and STS tasks, but the overall scores don't fully match yours. Could you share more details on how your implementation was set up for MTEB?
Here full results of my run
For some tasks difference is big. For example, on leaderboard AmazonCounterfactualClassification (en)
has 92.36, but I got 72.5. Is this possible to load 8192d using SentenceTransformer
? I can't find this in readme and in sentence transformer docs.
I open PR in mteb repo with my implementation and results. Here is full comparison of the results.
Classification
model name | AmazonCounterfactualClassification (en) | EmotionClassification | ToxicConvesationsClassification |
---|---|---|---|
stella_en_400M_v5 (leaderboard) | 92.36 | 78.77 | 89.94 |
stella_en_400M_v5 | 72.59 | 56.48 | 66.11 |
Clustering
model name | ArxivClusteringS2S | RedditClustering |
---|---|---|
stella_en_400M_v5 (leaderboard) | 49.82 | 71.19 |
stella_en_400M_v5 | 45.54 | 60.75 |
PairClassification
model name | SprintDuplicateQuestions | TwitterSemEval2015 |
---|---|---|
stella_en_400M_v5 (leaderboard) | 95.59 | 80.18 |
stella_en_400M_v5 | 94.44 | 80.26 |
Reranking
model name | SciDocsRR | AskUbuntuDupQuestions |
---|---|---|
stella_en_400M_v5 (leaderboard) | 88.44 | 66.15 |
stella_en_400M_v5 | 86.40 | 62.90 |
Retrieval
model name | SCIDOCS | SciFact |
---|---|---|
stella_en_400M_v5 (leaderboard) | 25.04 | 78.23 |
stella_en_400M_v5 | 23.96 | 77.96 |
STS
model name | STS16 | STSBenchmark |
---|---|---|
stella_en_400M_v5 (leaderboard) | 87.14 | 87.74 |
stella_en_400M_v5 | 87.00 | 87.56 |
Summarization
model name | SummEval |
---|---|
stella_en_400M_v5 (leaderboard) | 31.66 |
stella_en_400M_v5 | 30.59 |
@infgrad Can you provide details how you evaluated mteb?
I tried to run your model with gte_loader
stella_en_400M = ModelMeta(
loader=partial(
gte_loader,
model_name_or_path="dunzhang/stella_en_400M_v5",
attn="cccc",
pooling_method="lasttoken",
mode="embedding",
torch_dtype="auto",
# The ST script does not normalize while the HF one does so unclear what to do
# https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct#sentence-transformers
normalized=True,
),
name="dunzhang/stella_en_400M_v5",
languages=["eng_Latn"],
open_source=True,
revision="1bb50bc7bb726810eac2140e62155b88b0df198f",
release_date="2024-07-12",
)
and it gives better results, but still lower than you reported
Hi,
@Samoed
Try these settings:
- max_len = 400
- do not normalize vectors for Classification task
- use e5-mistral prompts, stella model's evaluation is same as e5-mistral or gte-qwen2
- inference with bf16, e.g.
load_dtype = torch.bf16
It's been so long I can't remember the details.
However, I have recently been working on multimodal encoder, as part of that work I'm going to have to reproduce stella's results and upload the evaluation scripts.
Finally, If you still cannot reproduce the results, you can wait a while.
Thank you!