lodestone-base-4096-v1

Hum-Works/lodestone-base-4096-v1. Griffin McCauley, Will Fortin, Dylan DiGioia 2023

This new sentence-transformers model from Hum maps long sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Abstract

In the hopes of furthering Hum's overarching mission of increasing the accessibility and interconnectivity of human knowledge, this model was developed as part of a project intending to boost the maximum input sequence length of sentence embedding models by leveraging recent architectural advances in the design of transformer models such as the incorporation of FlashAttention, Attention with Linear Biases (ALiBi), and Gated Linear Units (GLU). These modifications and enhancements were implemented by the team at MosaicML who designed and constructed the pre-trained mosaic-bert-base-seqlen-2048 model, and more information regarding the details of their development and testing specifications can be found on the model card.

While the fine-tuning procedure followed during the course of this project loosely mirrors that of the of the original Flax-sentence-embeddings team responsible for the creation of many other popular sentence-transformers models (e.g. all-mpnet-base-v2, all-distilroberta-v1, and all-MiniLM-L6-v2), our methodology includes novel techniques for data loading, batch sampling, and model checkpointing intended to improve training efficiency with regards to memory allocation and data storage.

Through combining these well-established and proven fine-tuning practices with novel advances in transformer architectural elements, our lodestone-base-4096-v1 model is able to achieve comparable performance metrics on standard text embedding evaluation benchmarks while also supporting a longer and more robust input sequence length of 4096 while retaining a smaller, more manageable size capable of being run on either a GPU or CPU.

Usage

Using this model becomes relatively easy when you have sentence-transformers installed. At the time of publishing, sentence-transformers does not support remote code which is required for flash-attention used by the model. A fork of the sentence-transformers repository that allows remote code execution is provided for convenience. It can be installed using the following command:

pip install git+https://github.com/Hum-Works/sentence-transformers.git
pip install einops

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Hum-Works/lodestone-base-4096-v1', trust_remote_code=True, revision='v1.0.0')
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)

Note: The model will use the openAI/Triton implementation of FlashAttention if installed. This is more performant than the fallback, torch implementation. Some platforms and GPUs may not be supported by Triton - up to date compatibility can be found on Triton’s github page.

Background

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained mosaic-bert-base-seqlen-2048 model and fine-tuned it on a nearly 1.5B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

Intended uses

Our model is intended to be used as a long sentence and paragraph encoder. Given an input text, it outputs a vector containing the semantic information. The sentence vector may be used for information retrieval, clustering, or sentence similarity tasks.

Training procedure

Pre-training

We use the pretrained mosaic-bert-base-seqlen-2048. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the dot product of each possible sentence pairing in the batch. We then apply the cross entropy loss by comparing with true pairs.

Hyperparameters

We trained our model on an ml.g5.4xlarge EC2 instance with 1 NVIDIA A10G Tensor Core GPU. We train the model during 1.4 million steps using a batch size of 16. We use a learning rate warm up of 500. The sequence length during training was limited to 2048 tokens. We used the AdamW optimizer with a 2e-5 learning rate and weight decay of 0.01 (i.e. the default parameter values for SentenceTransformer.fit()). The full training script is accessible in this current repository: Training.py.

Model Architecture

By incorporating FlashAttention, Attention with Linear Biases (ALiBi), and Gated Linear Units (GLU), this model is able to handle input sequences of 4096, 8x longer than that supported by most comparable sentence embedding models. The model was trained using a sequence length maximum of 2048, but the final model has a maximum sequence length of 4096. This is accomplished by taking advantage of ALiBi’s positional attention extrapolation which has been shown to allow sequence lengths of 2x the initial trained length.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 4096, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

Training data

We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is nearly 1.5 billion sentences. We sampled each dataset given a weighted probability proportional to its relative contribution to the entire dataset. The breakdown of the dataset can be seen below, and the entire dataset can be publicly accessed and uploaded via the Dataloading.ipynb located within this repository.

Dataset	Paper	Number of training tuples
Reddit comments (2015-2018)	paper	726,484,430
S2ORC Citation pairs (Abstracts)	paper	252,102,397
Reddit posts (Title, Body) pairs	-	127,445,911
Amazon reviews (2018) (Title, Review) pairs	-	87,877,725
WikiAnswers Duplicate question pairs	paper	77,427,422
PAQ (Question, Answer) pairs	paper	64,371,441
S2ORC Citation pairs (Titles)	paper	52,603,982
S2ORC (Title, Abstract)	paper	41,769,185
Stack Exchange (Title, Body) pairs	-	25,368,423
MS MARCO triplets	paper	9,144,553
Stack Exchange (Title, Most Upvoted Answer) pairs	-	4,784,250
Stack Exchange (Title+Body, Most Upvoted Answer) pairs	-	4,551,660
GOOAQ: Open Question Answering with Diverse Answer Types	paper	3,012,496
Amazon QA	-	2,507,114
Code Search	-	1,375,067
Yahoo Answers (Title, Answer)	paper	1,198,260
[AG News]((Title, Description) pairs of news articles from the AG News dataset)	-	1,157,745
COCO Image captions	paper	828,395
SPECTER citation triplets	paper	684,100
Yahoo Answers (Question, Answer)	paper	681,164
Yahoo Answers (Title, Question)	paper	659,896
CC News (Title, article) pairs	-	614,664
NPR (Title, Body) pairs	-	594,384
SearchQA	paper	582,261
MS Marco (Query, Answer Passage) pairs	paper	532,751
Stack Exchange (Title, Body) pairs	-	364,000
Eli5	paper	325,475
Flickr 30k	paper	317,695
CNN & DailyMail (highlight sentences, article) pairs	-	311,971
Stack Exchange Duplicate questions (titles)	-	304,524
AllNLI (SNLI and MultiNLI	paper SNLI, paper MultiNLI	277,230
Stack Exchange Duplicate questions (bodies)	-	250,518
Stack Exchange Duplicate questions (titles+bodies)	-	250,459
XSUM (Summary, News Article) pairs	-	226,711
Stack Exchange (Title+Body, Most Upvoted Answer, Most Downvoted Answer) triplets	-	216,454
Sentence Compression	paper	180,000
FEVER training data	-	139,051
Wikihow	paper	128,542
SearchQA (Question, Top-Snippet)	paper	117,384
Altlex	paper	112,696
Quora Question Duplicates	-	103,663
Quora Question Triplets	-	103,663
Simple Wikipedia	paper	102,225
Natural Questions (NQ)	paper	100,231
SQuAD2.0	paper	87,599
TriviaQA	-	73,346
Total		1,492,453,113

Replication

The entire fine-tuning process for this model can be replicated by following the steps outlined in the Replication.txt file within this repository. This document explains how to modify the sentence-transformers library, configure the pre-trained mosaic-bert-base-seqlen-2048 model, load all of the training data, and execute the training script.

Limitations

Due to technical constraints (e.g. limited GPU memory capacity), this model was trained with a smaller batch size of 16, making it so that each step during training was less well-informed than it would have been on a higher performance system. This smaller than ideal hyperparameter value will generally cause the model to be more likely to get stuck in a local minimum and for the parameter configuration to take a longer time to converge to the optimum. In order to counteract this potential risk, we trained the model for a larger number of steps than many of its contemporaries to ensure a greater chance of achieving strong performance, but this is an area which could be improved if further fine-tuning was performed.

It is also worth noting that, while this model is able to handle longer input sequences of up to 4096 word pieces, the training dataset used consists of sentence and paragraph pairs and triplets which do not necessarily reach that maximum sequence length. Since the data was not tailored specifically for this larger input size, further fine-tuning may be required to ensure highly accurate embeddings for longer texts of that magnitude.

Finally, as stated on https://huggingface.co/datasets/sentence-transformers/reddit-title-body, an additional reminder and warning regarding the Reddit posts data is that one should "Be aware that this dataset is not filtered for biases, hate-speech, spam, racial slurs etc. It depicts the content as it is posted on Reddit." Thus, while we believe this has not induced any pathological behaviors in the model's performance due to its relatively low prevalence of records in the whole dataset of nearly 1.5B sentence pairs and the fact that this model was trained to produce semantic embeddings rather than generative text outputs, it is always important to be aware of vulnerabilities to bias.

Hum-Works
/

lodestone-base-4096-v1