dangvantuan's picture
Update README.md
5682018
|
raw
history blame
8.81 kB
metadata
pipeline_tag: sentence-similarity
language: fr
datasets:
  - stsb_multi_mt
tags:
  - Text
  - Sentence Similarity
  - Sentence-Embedding
  - camembert-base
license: apache-2.0
model-index:
  - name: sentence-camembert-base by Van Tuan DANG
    results:
      - task:
          name: Sentence-Embedding
          type: Text Similarity
        dataset:
          name: Text Similarity fr
          type: stsb_multi_mt
          args: fr
        metrics:
          - name: Test Pearson correlation coefficient
            type: Pearson_correlation_coefficient
            value: 86.88

Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.

This model is improved from dangvantuan/sentence-camembert-base using fine-tuning with Augmented SBERT on dataset stsb along with Pair Sampling Strategies through 2 models CrossEncoder-camembert-large and dangvantuan/sentence-camembert-large

Usage

The model can be used directly (without a language model) as follows:

from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("Lajavaness/sentence-camembert-base")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)

Evaluation

The model can be evaluated as follows on the French test data of stsb.

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")

Test Result: The performance is measured using Pearson and Spearman correlation on the sts-benchmark:

  • On dev
Model Pearson correlation Spearman correlation #params
Lajavaness/sentence-camembert-base 86.88 86.73 110M
dangvantuan/sentence-camembert-base 86.73 86.54 110M
inokufu/flaubert-base-uncased-xnli-sts 85.85 85.71 137M
distiluse-base-multilingual-cased 79.22 79.16 135M
  • On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset:

Pearson score

Model STS-B STS12-fr STS13-fr STS14-fr STS15-fr STS16-fr SICK-fr params
Lajavaness/sentence-camembert-base 83.46 84.49 84.61 83.94 86.94 75.20 82.86 110M
inokufu/flaubert-base-uncased-xnli-sts 82.82 84.79 85.76 82.81 85.38 74.05 82.23 137M
dangvantuan/sentence-camembert-base 82.36 82.06 84.08 81.51 85.54 73.97 80.91 110M
sentence-transformers/distiluse-base-multilingual-cased-v2 78.63 72.51 67.25 70.12 79.93 66.67 77.76 135M
hugorosen/flaubert_base_uncased-xnli-sts 78.38 79.00 77.61 76.56 79.03 71.22 80.58 137M
antoinelouis/biencoder-camembert-base-mmarcoFR 76.97 71.43 73.50 70.56 78.44 71.23 77.62 110M

Spearman score

Model STS-B STS12-fr STS13-fr STS14-fr STS15-fr STS16-fr SICK-fr params
inokufu/flaubert-base-uncased-xnli-sts 83.07 77.34 85.88 80.96 85.70 76.43 77.00 137M
Lajavaness/sentence-camembert-base 82.92 77.71 84.19 81.83 87.04 76.81 76.36 110M
dangvantuan/sentence-camembert-base 81.64 75.45 83.86 78.63 85.66 75.36 74.18 110M
sentence-transformers/distiluse-base-multilingual-cased-v2 77.49 69.80 68.85 68.17 80.27 70.04 72.49 135M
hugorosen/flaubert_base_uncased-xnli-sts 76.93 68.96 77.62 71.87 79.33 72.86 73.91 137M
antoinelouis/biencoder-camembert-base-mmarcoFR 75.55 66.89 73.90 67.14 78.78 72.64 72.03 110M

Citation

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}


@article{martin2020camembert,
   title={CamemBERT: a Tasty French Language Mode},
   author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   year={2020}
}
@article{thakur2020augmented,
  title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
  author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
  journal={arXiv e-prints},
  pages={arXiv--2010},
  year={2020}