Some questions about reproduceing this MTEB-result

#3
by infgrad - opened

HI, Cannot reproduce this MTEB-result, could you please re-check the model weight?
Besides, what is the pooling of this model, in your transformer code example, the pooling is last_token, however in your sentence-transformer code, the pooling is mean( loss 1_Pooling? The default pooling in sentence-transformers is mean. )

infgrad changed discussion title from Cannot reproduce this MTEB-result to Some questions about reproduceing this MTEB-result

Hi infgrad,

Thank you for bringing this to our attention. We have carefully reviewed your concerns and retested the model weights. We can confirm that these weights do indeed reproduce the MTEB results as expected.

Regarding the pooling method, we originally used last-token pooling. However, thanks to Tom's assistance, we have corrected the discrepancy in the sentence transformer implementation. The issue has been resolved, and the correct pooling method is now consistently applied.

Could you please retry using the last-token pooling and let us know the results you achieve? We are eager to further discuss and ensure the reproducibility of our model.

Best,
Ye

@yliu279 I cannot reproduce the results with sentence-transformers code (using latest one). When runing the eval of BEIR-FiQA on the curated, sentence-transformers-based pipeline, I get recall@10=17.111, where it should be 69.440. Sounds like pooling might be somehow wrong, could you confirm that padding&accessing last token works the same way across transformers and sentence-transformers pipelines?

@yliu279 @mbien
Hi, I carefully reviewed my test codes and still get the same results. Here is minimal reproduction code:


import functools
import os
from mteb import MTEB
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # load model
    model = SentenceTransformer("/mnt/hwdata/ip/nlp/public_models/SFR-Embedding-2_R", device="cuda")
    model.encode = functools.partial(
        model.encode,
        batch_size=8,
        show_progress_bar=True,
        prompt="Instruct: Retrieve semantically similar text.\nQuery: "  # only test STS
    )

    evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
    evaluation.run(
        model,
        output_folder=f"sts_results",
        eval_splits=["test"],
        verbosity=2,
        overwrite_results=True,
    )

The result is:

{
  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
  "evaluation_time": 84.45004653930664,
  "kg_co2_emissions": null,
  "mteb_version": "1.12.48",
  "scores": {
    "test": [
      {
        "cosine_pearson": 0.701287466112608,
        "cosine_spearman": 0.7236247747370012,
        "euclidean_pearson": 0.7204492422443474,
        "euclidean_spearman": 0.7233661781589509,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.7236247747370012,
        "manhattan_pearson": 0.7397991758538812,
        "manhattan_spearman": 0.7408110026601888,
        "pearson": [
          0.7012874418016998,
          1.2264194202428853e-204
        ],
        "spearman": [
          0.7236247747370012,
          5.439385473263533e-224
        ]
      }
    ]
  },
  "task_name": "STSBenchmark"
}

However, in your README.md, the result is:

  - task:
      type: STS
    dataset:
      type: mteb/stsbenchmark-sts
      name: MTEB STSBenchmark
      config: default
      split: test
      revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
    metrics:
    - type: cos_sim_pearson
      value: 83.55433725920493
    - type: cos_sim_spearman
      value: 83.60373857254014
    - type: euclidean_pearson
      value: 83.08086082334839
    - type: euclidean_spearman
      value: 83.6036864776559
    - type: manhattan_pearson
      value: 83.2232267589246
    - type: manhattan_spearman
      value: 83.78923946962664

~<

@yliu279 could you provide us some hint on reproducing the scores?

Hi @infgrad @mbien

We noticed a discrepancy in the Sentence Transformer Evaluation. We are currently working on resolving this issue and will share the solution shortly. In the meantime, here is the process we use to produce the results. Please feel free to try it if you are interested:
Use E5 evaluation pipeline: https://github.com/microsoft/unilm/blob/master/e5/mteb_except_retrieval_eval.py
First Two editions in utils.py:

  1. Add 'SFR-Embedding-2_R': 'instruction', to MODEL_NAME_TO_PREFIX_TYPE dict and 'SFR-Embedding-2_R': 'last' to MODEL_NAME_TO_POOL_TYPE in utils.py
  2. revise create_batch_dict() function in utils.py as:
        batch_dict = tokenizer(
            input_texts,
            max_length=max_length - 1,
            return_token_type_ids=False,
            return_attention_mask=False,
            padding=False,
            truncation=True
        )

        # append eos_token_id to every input_ids
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]

        return tokenizer.pad(
            batch_dict,
            padding=True,
            pad_to_multiple_of=8,
            return_attention_mask=True,
            return_tensors="pt",
        )

Second:
In e5_mteb_except_retrieval_eval.py main() function:

model = DenseEncoder()
evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
evaluation.run(
        model,
        output_folder=f"sts_results",
        eval_splits=["test"],
        verbosity=2,
        overwrite_results=True,
    )

You will get results as below:

{
  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
  "evaluation_time": 14.305434942245483,
  "kg_co2_emissions": null,
  "mteb_version": "1.12.48",
  "scores": {
    "test": [
      {
        "cosine_pearson": 0.8355240450842275,
        "cosine_spearman": 0.8360701599480195,
        "euclidean_pearson": 0.8307927408782112,
        "euclidean_spearman": 0.8360703731734451,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.8360701599480195,
        "manhattan_pearson": 0.832215434631109,
        "manhattan_spearman": 0.8378697003913586,
        "pearson": 0.8355240450842275,
        "spearman": 0.8360701599480195
      }
    ]
  },
  "task_name": "STSBenchmark"
}

Hi @infgrad @mbien ,

The Sentence Transformer evaluation is now functioning correctly. We have added "add_eos_token": true in the tokenizer_config.json. You can now obtain accurate results using the ST evaluation.

  "scores": {
    "test": [
      {
        "cosine_pearson": 0.8355526890934296,
        "cosine_spearman": 0.8360173852997346,
        "euclidean_pearson": 0.830706240702224,
        "euclidean_spearman": 0.8365412824235895,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.8360173852997346,
        "manhattan_pearson": 0.8318737804127988,
        "manhattan_spearman": 0.8380955443197002,
        "pearson": [
          0.8355526691849025,
          0.0
        ],
        "spearman": [
          0.8360186564578723,
          0.0
        ]
      }
    ]
  },
  "task_name": "STSBenchmark"

Great!
I can now obtain accurate results in all MTEB tasks.

infgrad changed discussion status to closed

Sign up or log in to comment