lodestone-base-4096-v1
Hum-Works/lodestone-base-4096-v1. Griffin McCauley, Will Fortin, Dylan DiGioia 2023
This new sentence-transformers model from Hum maps long sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Abstract
In the hopes of furthering Hum's overarching mission of increasing the accessibility and interconnectivity of human knowledge, this model was developed as part of a project intending to boost the maximum input sequence length of sentence embedding models by leveraging recent architectural advances in the design of transformer models such as the incorporation of FlashAttention, Attention with Linear Biases (ALiBi), and Gated Linear Units (GLU). These modifications and enhancements were implemented by the team at MosaicML who designed and constructed the pre-trained mosaic-bert-base-seqlen-2048
model, and more information regarding the details of their development and testing specifications can be found on the model card.
While the fine-tuning procedure followed during the course of this project loosely mirrors that of the of the original Flax-sentence-embeddings team responsible for the creation of many other popular sentence-transformers models (e.g. all-mpnet-base-v2, all-distilroberta-v1, and all-MiniLM-L6-v2), our methodology includes novel techniques for data loading, batch sampling, and model checkpointing intended to improve training efficiency with regards to memory allocation and data storage.
Through combining these well-established and proven fine-tuning practices with novel advances in transformer architectural elements, our lodestone-base-4096-v1
model is able to achieve comparable performance metrics on standard text embedding evaluation benchmarks while also supporting a longer and more robust input sequence length of 4096 while retaining a smaller, more manageable size capable of being run on either a GPU or CPU.
Usage
Using this model becomes relatively easy when you have sentence-transformers installed. At the time of publishing, sentence-transformers does not support remote code which is required for flash-attention used by the model. A fork of the sentence-transformers repository that allows remote code execution is provided for convenience. It can be installed using the following command:
pip install git+https://github.com/Hum-Works/sentence-transformers.git
pip install einops
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Hum-Works/lodestone-base-4096-v1', trust_remote_code=True, revision='v1.0.0')
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
Note: The model will use the openAI/Triton implementation of FlashAttention if installed. This is more performant than the fallback, torch implementation. Some platforms and GPUs may not be supported by Triton - up to date compatibility can be found on Triton’s github page.
Background
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained mosaic-bert-base-seqlen-2048
model and fine-tuned it on a nearly 1.5B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
Intended uses
Our model is intended to be used as a long sentence and paragraph encoder. Given an input text, it outputs a vector containing the semantic information. The sentence vector may be used for information retrieval, clustering, or sentence similarity tasks.
Training procedure
Pre-training
We use the pretrained mosaic-bert-base-seqlen-2048
. Please refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
We fine-tune the model using a contrastive objective. Formally, we compute the dot product of each possible sentence pairing in the batch. We then apply the cross entropy loss by comparing with true pairs.
Hyperparameters
We trained our model on an ml.g5.4xlarge EC2 instance with 1 NVIDIA A10G Tensor Core GPU. We train the model during 1.4 million steps using a batch size of 16. We use a learning rate warm up of 500. The sequence length during training was limited to 2048 tokens. We used the AdamW optimizer with a 2e-5 learning rate and weight decay of 0.01 (i.e. the default parameter values for SentenceTransformer.fit()). The full training script is accessible in this current repository: Training.py
.
Model Architecture
By incorporating FlashAttention, Attention with Linear Biases (ALiBi), and Gated Linear Units (GLU), this model is able to handle input sequences of 4096, 8x longer than that supported by most comparable sentence embedding models. The model was trained using a sequence length maximum of 2048, but the final model has a maximum sequence length of 4096. This is accomplished by taking advantage of ALiBi’s positional attention extrapolation which has been shown to allow sequence lengths of 2x the initial trained length.
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 4096, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Normalize()
)
Training data
We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is nearly 1.5 billion sentences. We sampled each dataset given a weighted probability proportional to its relative contribution to the entire dataset.
The breakdown of the dataset can be seen below, and the entire dataset can be publicly accessed and uploaded via the Dataloading.ipynb
located within this repository.
Dataset | Paper | Number of training tuples |
---|---|---|
Reddit comments (2015-2018) | paper | 726,484,430 |
S2ORC Citation pairs (Abstracts) | paper | 252,102,397 |
Reddit posts (Title, Body) pairs | - | 127,445,911 |
Amazon reviews (2018) (Title, Review) pairs | - | 87,877,725 |
WikiAnswers Duplicate question pairs | paper | 77,427,422 |
PAQ (Question, Answer) pairs | paper | 64,371,441 |
S2ORC Citation pairs (Titles) | paper | 52,603,982 |
S2ORC (Title, Abstract) | paper | 41,769,185 |
Stack Exchange (Title, Body) pairs | - | 25,368,423 |
MS MARCO triplets | paper | 9,144,553 |
Stack Exchange (Title, Most Upvoted Answer) pairs | - | 4,784,250 |
Stack Exchange (Title+Body, Most Upvoted Answer) pairs | - | 4,551,660 |
GOOAQ: Open Question Answering with Diverse Answer Types | paper | 3,012,496 |
Amazon QA | - | 2,507,114 |
Code Search | - | 1,375,067 |
Yahoo Answers (Title, Answer) | paper | 1,198,260 |
[AG News]((Title, Description) pairs of news articles from the AG News dataset) | - | 1,157,745 |
COCO Image captions | paper | 828,395 |
SPECTER citation triplets | paper | 684,100 |
Yahoo Answers (Question, Answer) | paper | 681,164 |
Yahoo Answers (Title, Question) | paper | 659,896 |
CC News (Title, article) pairs | - | 614,664 |
NPR (Title, Body) pairs | - | 594,384 |
SearchQA | paper | 582,261 |
MS Marco (Query, Answer Passage) pairs | paper | 532,751 |
Stack Exchange (Title, Body) pairs | - | 364,000 |
Eli5 | paper | 325,475 |
Flickr 30k | paper | 317,695 |
CNN & DailyMail (highlight sentences, article) pairs | - | 311,971 |
Stack Exchange Duplicate questions (titles) | - | 304,524 |
AllNLI (SNLI and MultiNLI | paper SNLI, paper MultiNLI | 277,230 |
Stack Exchange Duplicate questions (bodies) | - | 250,518 |
Stack Exchange Duplicate questions (titles+bodies) | - | 250,459 |
XSUM (Summary, News Article) pairs | - | 226,711 |
Stack Exchange (Title+Body, Most Upvoted Answer, Most Downvoted Answer) triplets | - | 216,454 |
Sentence Compression | paper | 180,000 |
FEVER training data | - | 139,051 |
Wikihow | paper | 128,542 |
SearchQA (Question, Top-Snippet) | paper | 117,384 |
Altlex | paper | 112,696 |
Quora Question Duplicates | - | 103,663 |
Quora Question Triplets | - | 103,663 |
Simple Wikipedia | paper | 102,225 |
Natural Questions (NQ) | paper | 100,231 |
SQuAD2.0 | paper | 87,599 |
TriviaQA | - | 73,346 |
Total | 1,492,453,113 |
Replication
The entire fine-tuning process for this model can be replicated by following the steps outlined in the Replication.txt
file within this repository. This document explains how to modify the sentence-transformers library, configure the pre-trained mosaic-bert-base-seqlen-2048
model, load all of the training data, and execute the training script.
Limitations
Due to technical constraints (e.g. limited GPU memory capacity), this model was trained with a smaller batch size of 16, making it so that each step during training was less well-informed than it would have been on a higher performance system. This smaller than ideal hyperparameter value will generally cause the model to be more likely to get stuck in a local minimum and for the parameter configuration to take a longer time to converge to the optimum. In order to counteract this potential risk, we trained the model for a larger number of steps than many of its contemporaries to ensure a greater chance of achieving strong performance, but this is an area which could be improved if further fine-tuning was performed.
It is also worth noting that, while this model is able to handle longer input sequences of up to 4096 word pieces, the training dataset used consists of sentence and paragraph pairs and triplets which do not necessarily reach that maximum sequence length. Since the data was not tailored specifically for this larger input size, further fine-tuning may be required to ensure highly accurate embeddings for longer texts of that magnitude.
Finally, as stated on https://huggingface.co/datasets/sentence-transformers/reddit-title-body, an additional reminder and warning regarding the Reddit posts data is that one should "Be aware that this dataset is not filtered for biases, hate-speech, spam, racial slurs etc. It depicts the content as it is posted on Reddit." Thus, while we believe this has not induced any pathological behaviors in the model's performance due to its relatively low prevalence of records in the whole dataset of nearly 1.5B sentence pairs and the fact that this model was trained to produce semantic embeddings rather than generative text outputs, it is always important to be aware of vulnerabilities to bias.
- Downloads last month
- 239
Datasets used to train Hum-Works/lodestone-base-4096-v1
Spaces using Hum-Works/lodestone-base-4096-v1 4
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported69.731
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported31.618
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported63.303
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported86.898
- ap on MTEB AmazonPolarityClassificationtest set self-reported82.395
- f1 on MTEB AmazonPolarityClassificationtest set self-reported86.873
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported44.050
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported42.676
- map_at_1 on MTEB ArguAnatest set self-reported26.174
- map_at_10 on MTEB ArguAnatest set self-reported40.976