metadata

pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
language: en
license: apache-2.0
datasets:
  - s2orc
  - flax-sentence-embeddings/stackexchange_xml
  - ms_marco
  - gooaq
  - yahoo_answers_topics
  - code_search_net
  - search_qa
  - eli5
  - snli
  - multi_nli
  - wikihow
  - natural_questions
  - trivia_qa
  - embedding-data/sentence-compression
  - embedding-data/flickr30k-captions
  - embedding-data/altlex
  - embedding-data/simple-wiki
  - embedding-data/QQP
  - embedding-data/SPECTER
  - embedding-data/PAQ_pairs
  - embedding-data/WikiAnswers

Sentence Transformers

We are forking sentence-transformers/all-MiniLM-L6-v2 as it is similar to the targeting dataset and use case. For more details, please check the pre-trained model weight repository.

Fine-tuning

Fine-tune the model using a contrastive objective.
Compute the cosine similarity from each possible sentence pairs from the batch.
Then apply the cross entropy loss by comparing with true pairs.

Hyper parameters

Train the model during 100k steps using a batch size of 1024 (128 per TPU core).
Use a learning rate warm up of 500.
The sequence length was limited to 128 tokens.
Used the AdamW optimizer with a 2e-5 learning rate.
The full training script is accessible in this current repository: train_script.py.

Datasets

Dataset	Paper	Number of training tuples
Reddit comments (2015-2018)	paper	726,484,430
S2ORC Citation pairs (Abstracts)	paper	116,288,806
WikiAnswers Duplicate question pairs	paper	77,427,422
PAQ (Question, Answer) pairs	paper	64,371,441
S2ORC Citation pairs (Titles)	paper	52,603,982
S2ORC (Title, Abstract)	paper	41,769,185
Stack Exchange (Title, Body) pairs	-	25,316,456
Stack Exchange (Title+Body, Answer) pairs	-	21,396,559
Stack Exchange (Title, Answer) pairs	-	21,396,559
MS MARCO triplets	paper	9,144,553
GOOAQ: Open Question Answering with Diverse Answer Types	paper	3,012,496
Yahoo Answers (Title, Answer)	paper	1,198,260
Code Search	-	1,151,414
COCO Image captions	paper	828,395
SPECTER citation triplets	paper	684,100
Yahoo Answers (Question, Answer)	paper	681,164
Yahoo Answers (Title, Question)	paper	659,896
SearchQA	paper	582,261
Eli5	paper	325,475
Flickr 30k	paper	317,695
Stack Exchange Duplicate questions (titles)		304,525
AllNLI (SNLI and MultiNLI	paper SNLI, paper MultiNLI	277,230
Stack Exchange Duplicate questions (bodies)		250,519
Stack Exchange Duplicate questions (titles+bodies)		250,460
Sentence Compression	paper	180,000
Wikihow	paper	128,542
Altlex	paper	112,696
Quora Question Triplets	-	103,663
Simple Wikipedia	paper	102,225
Natural Questions (NQ)	paper	100,231
SQuAD2.0	paper	87,599
TriviaQA	-	73,346
Total		1,170,060,424