MTEB reproduction

#21

by Samoed - opened Oct 23, 2024

Oct 23, 2024

Hi! I wanted to add your model to MTEB so that everyone can easily run it using the platform. I used the following prompts for your model (full code here):

STELLA_S2S_PROMPT = "Instruct: Retrieve semantically similar text.\nQuery: "
STELLA_S2P_PROMPT = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "

STELLA_PROMPTS = {
    "query": STELLA_S2P_PROMPT,
    "passage": "",
    "STS": STELLA_S2S_PROMPT,
    "PairClassification": STELLA_S2S_PROMPT,
    "BitextMining": STELLA_S2S_PROMPT,
}

I've obtained similar results for Pair Classification and STS tasks, but the overall scores don't fully match yours. Could you share more details on how your implementation was set up for MTEB?
Here full results of my run

tomaarsen

Oct 23, 2024

If the difference is rather small, then it's probably because the MTEB results from this model are for if you use the full 8192 dimensions, but the difference should be small:

Generally speaking, 1024d is good enough. The MTEB score of 1024d is only 0.001 lower than 8192d.

Tom Aarsen

Samoed

Oct 23, 2024

•

edited Oct 23, 2024

For some tasks difference is big. For example, on leaderboard AmazonCounterfactualClassification (en) has 92.36, but I got 72.5. Is this possible to load 8192d using SentenceTransformer? I can't find this in readme and in sentence transformer docs.

Samoed

Oct 24, 2024

I open PR in mteb repo with my implementation and results. Here is full comparison of the results.

Classification

model name	AmazonCounterfactualClassification (en)	EmotionClassification	ToxicConvesationsClassification
stella_en_400M_v5 (leaderboard)	92.36	78.77	89.94
stella_en_400M_v5	72.59	56.48	66.11

Clustering

model name	ArxivClusteringS2S	RedditClustering
stella_en_400M_v5 (leaderboard)	49.82	71.19
stella_en_400M_v5	45.54	60.75

PairClassification

model name	SprintDuplicateQuestions	TwitterSemEval2015
stella_en_400M_v5 (leaderboard)	95.59	80.18
stella_en_400M_v5	94.44	80.26

Reranking

model name	SciDocsRR	AskUbuntuDupQuestions
stella_en_400M_v5 (leaderboard)	88.44	66.15
stella_en_400M_v5	86.40	62.90

Retrieval

model name	SCIDOCS	SciFact
stella_en_400M_v5 (leaderboard)	25.04	78.23
stella_en_400M_v5	23.96	77.96

STS

model name	STS16	STSBenchmark
stella_en_400M_v5 (leaderboard)	87.14	87.74
stella_en_400M_v5	87.00	87.56

Summarization

model name	SummEval
stella_en_400M_v5 (leaderboard)	31.66
stella_en_400M_v5	30.59

Full results

@infgrad Can you provide details how you evaluated mteb?

Samoed

Oct 24, 2024

•

edited Oct 24, 2024

I tried to run your model with gte_loader

stella_en_400M = ModelMeta(
    loader=partial(
        gte_loader,
        model_name_or_path="dunzhang/stella_en_400M_v5",
        attn="cccc",
        pooling_method="lasttoken",
        mode="embedding",
        torch_dtype="auto",
        # The ST script does not normalize while the HF one does so unclear what to do
        # https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct#sentence-transformers
        normalized=True,
    ),
    name="dunzhang/stella_en_400M_v5",
    languages=["eng_Latn"],
    open_source=True,
    revision="1bb50bc7bb726810eac2140e62155b88b0df198f",
    release_date="2024-07-12",
)

and it gives better results, but still lower than you reported

infgrad

StellaEncoder org Oct 24, 2024

Hi, @Samoed
Try these settings:

max_len = 400
do not normalize vectors for Classification task
use e5-mistral prompts, stella model's evaluation is same as e5-mistral or gte-qwen2
inference with bf16, e.g. load_dtype = torch.bf16

It's been so long I can't remember the details.

However, I have recently been working on multimodal encoder, as part of that work I'm going to have to reproduce stella's results and upload the evaluation scripts.

Finally, If you still cannot reproduce the results, you can wait a while.

Samoed

Oct 24, 2024

Thank you!

Samoed changed discussion status to closed Oct 24, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment