Duplicate from Lajavaness/sentence-camembert-base

aa8d8fe verified 4 months ago

8.83 kB

	---
	pipeline_tag: sentence-similarity
	language: fr
	datasets:
	- stsb_multi_mt
	tags:
	- Text
	- Sentence Similarity
	- Sentence-Embedding
	- camembert-base
	license: apache-2.0
	model-index:
	- name: sentence-camembert-base by Van Tuan DANG
	results:
	- task:
	name: Sentence-Embedding
	type: Text Similarity
	dataset:
	name: Text Similarity fr
	type: stsb_multi_mt
	args: fr
	metrics:
	- name: Test Pearson correlation coefficient
	type: Pearson_correlation_coefficient
	value: 86.88
	library_name: sentence-transformers
	---

	## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
	This model is improved from [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) using fine-tuning with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) along with Pair Sampling Strategies through 2 models [CrossEncoder-camembert-large](https://huggingface.co/dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)
	## Usage
	The model can be used directly (without a language model) as follows:

	```python
	from sentence_transformers import SentenceTransformer
	model = SentenceTransformer("Lajavaness/sentence-camembert-base")

	sentences = ["Un avion est en train de décoller.",
	"Un homme joue d'une grande flûte.",
	"Un homme étale du fromage râpé sur une pizza.",
	"Une personne jette un chat au plafond.",
	"Une personne est en train de plier un morceau de papier.",
	]

	embeddings = model.encode(sentences)
	```

	## Evaluation
	The model can be evaluated as follows on the French test data of stsb.

	```python
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.readers import InputExample
	from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
	from datasets import load_dataset
	def convert_dataset(dataset):
	dataset_samples=[]
	for df in dataset:
	score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
	inp_example = InputExample(texts=[df['sentence1'],
	df['sentence2']], label=score)
	dataset_samples.append(inp_example)
	return dataset_samples

	# Loading the dataset for evaluation
	df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
	df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

	# Convert the dataset for evaluation

	# For Dev set:
	dev_samples = convert_dataset(df_dev)
	val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
	val_evaluator(model, output_path="./")

	# For Test set:
	test_samples = convert_dataset(df_test)
	test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
	test_evaluator(model, output_path="./")
	```

	Test Result:
	The performance is measured using Pearson and Spearman correlation on the sts-benchmark:
	- On dev


	\| Model \| Pearson correlation \| Spearman correlation \| #params \|
	\| ------------- \| ------------- \| ------------- \|------------- \|
	\| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)\| 86.88 \|86.73 \| 110M \|
	\| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)\| 86.73 \|86.54 \| 110M \|
	[inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)\| 85.85 \|85.71 \| 137M \|
	\| [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) \| 79.22 \| 79.16\|135M \|


	- On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset:

	Pearson score
	\| Model \| [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) \| [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)\| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) \| [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) \| [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) \| [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) \| [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) \| params \|
	\|-----------------------------------------------------------\|---------\|----------\|----------\|----------\|----------\|----------\|---------\|--------\|
	\| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) \| 83.46 \| 84.49 \| 84.61 \| 83.94 \| 86.94 \| 75.20 \| 82.86 \| 110M \|
	\| [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) \| 82.82 \| 84.79 \| 85.76 \| 82.81 \| 85.38 \| 74.05 \| 82.23 \| 137M \|
	\| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) \| 82.36 \| 82.06 \| 84.08 \| 81.51 \| 85.54 \| 73.97 \| 80.91 \| 110M \|
	\| [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)\| 78.63 \| 72.51 \| 67.25 \| 70.12 \| 79.93 \| 66.67 \| 77.76 \| 135M \|
	\| [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) \| 78.38 \| 79.00 \| 77.61 \| 76.56 \| 79.03 \| 71.22 \| 80.58 \| 137M \|
	\| [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) \| 76.97 \| 71.43 \| 73.50 \| 70.56 \| 78.44 \| 71.23 \| 77.62 \| 110M \|


	Spearman score
	\| Model \| [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) \| [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)\| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) \| [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) \| [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) \| [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) \| [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) \| params \|
	\|-----------------------------------------------------------\|---------\|----------\|----------\|----------\|----------\|----------\|---------\|--------\|
	\| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) \| 82.92 \| 77.71 \| 84.19 \| 81.83 \| 87.04 \| 76.81 \| 76.36 \| 110M \|
	\| [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) \| 83.07 \| 77.34 \| 85.88 \| 80.96 \| 85.70 \| 76.43 \| 77.00 \| 137M \|
	\| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) \| 81.64 \| 75.45 \| 83.86 \| 78.63 \| 85.66 \| 75.36 \| 74.18 \| 110M \|
	\| [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) \| 77.49 \| 69.80 \| 68.85 \| 68.17 \| 80.27 \| 70.04 \| 72.49 \| 135M \|
	\| [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) \| 76.93 \| 68.96 \| 77.62 \| 71.87 \| 79.33 \| 72.86 \| 73.91 \| 137M \|
	\| [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) \| 75.55 \| 66.89 \| 73.90 \| 67.14 \| 78.78 \| 72.64 \| 72.03 \| 110M \|


	## Citation


	@article{reimers2019sentence,
	title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
	author={Nils Reimers, Iryna Gurevych},
	journal={https://arxiv.org/abs/1908.10084},
	year={2019}
	}


	@article{martin2020camembert,
	title={CamemBERT: a Tasty French Language Mode},
	author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
	journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
	year={2020}
	}
	@article{thakur2020augmented,
	title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
	author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
	journal={arXiv e-prints},
	pages={arXiv--2010},
	year={2020}