---
base_model: sentence-transformers/all-mpnet-base-v2
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:300000
- loss:CoSENTLoss
widget:
- source_sentence: SELECT DISTINCT count(alias3.col1) , alias1.col2 FROM table1 AS
alias1 JOIN table2 AS alias2 ON alias1.col2 = alias2.col2 JOIN table3 AS alias3
ON alias1.col1 = alias3.col1 WHERE alias2.col3 = str AND alias3.year = num GROUP
BY alias1.col2
sentences:
- SELECT col1 , avg(col2) FROM table1 WHERE col3 LIKE str GROUP BY col1
- SELECT col1 , col2 FROM table1 WHERE col3 LIKE str GROUP BY col1 ORDER BY count(*)
DESC LIMIT num
- SELECT col1 , avg(col2) FROM table1 GROUP BY col1 ORDER BY avg(col2)
- source_sentence: SELECT alias2.year FROM table1 AS alias1 JOIN table2 AS alias2
ON alias1.col1 = alias2.col2 WHERE alias1.alias1 = str
sentences:
- SELECT alias1.col1 , alias2.col2 FROM table1 AS alias1 JOIN table2 AS alias2 ON
alias1.col3 = alias2.col3
- SELECT DISTINCT alias1.col1 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias2.col2
= alias1.col3 JOIN table3 AS alias3 ON alias2.col4 = alias3.col3 WHERE alias3.col5
> num
- SELECT col1 FROM table1 ORDER BY col2 LIMIT num
- source_sentence: SELECT DISTINCT count(alias2.col1) FROM table1 AS alias1 JOIN table2
AS alias2 ON alias1.col2 = alias2.col2 WHERE alias1.col3 = str
sentences:
- SELECT alias3.col1 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col2
= alias2.col2 JOIN table3 AS alias3 ON alias2.col3 = alias3.col3 WHERE alias1.col4
= str AND alias1.col5 = str
- SELECT count(DISTINCT col1) FROM table1 WHERE col1 NOT IN ( SELECT col2 FROM table2
)
- SELECT count(*) FROM table1 WHERE col1 = str AND col2 < num
- source_sentence: SELECT alias1.col1 FROM table1 AS alias1 JOIN table2 AS alias2
ON alias1.col2 = alias2.col2 WHERE alias2.col3 LIKE str
sentences:
- SELECT col1 FROM table1 ORDER BY col2 DESC
- SELECT col1 FROM table1 WHERE col2 NOT IN (SELECT col2 FROM table2)
- SELECT alias1.col1 , alias1.col2 , alias1.col3 FROM table1 AS alias1 JOIN table2
AS alias2 ON alias1.col4 = alias2.col5 ORDER BY alias2.col6 LIMIT num
- source_sentence: SELECT alias1.col1 FROM table1 AS alias1 JOIN table2 AS alias2
ON alias1.col2 = alias2.col2 JOIN table3 AS alias3 ON alias2.col3 = alias3.col3
WHERE alias3.col4 = str INTERSECT SELECT alias1.col1 FROM table1 AS alias1 JOIN
table2 AS alias2 ON alias1.col2 = alias2.col2 JOIN table3 AS alias3 ON alias2.col3
= alias3.col3 WHERE alias3.col4 = str
sentences:
- SELECT count(*) FROM table1
- SELECT count(DISTINCT col1) FROM table1
- SELECT count(col1) FROM table1 WHERE col2 = num
---
# SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- **Maximum Sequence Length:** 384 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("s2593817/sft-sql-embedding")
# Run inference
sentences = [
'SELECT alias1.col1 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col2 = alias2.col2 JOIN table3 AS alias3 ON alias2.col3 = alias3.col3 WHERE alias3.col4 = str INTERSECT SELECT alias1.col1 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col2 = alias2.col2 JOIN table3 AS alias3 ON alias2.col3 = alias3.col3 WHERE alias3.col4 = str',
'SELECT count(col1) FROM table1 WHERE col2 = num',
'SELECT count(DISTINCT col1) FROM table1',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 300,000 training samples
* Columns: sentence1
, sentence2
, and score
* Approximate statistics based on the first 1000 samples:
| | sentence1 | sentence2 | score |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
| type | string | string | float |
| details |
SELECT DISTINCT count(DISTINCT alias4.col1) , alias3.col2 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col3 = alias2.col3 JOIN table3 AS alias3 ON alias3.col4 = alias1.col4 JOIN table4 AS alias4 ON alias3.col4 = alias4.col5 WHERE alias2.col6 = str GROUP BY alias3.col2 ORDER BY count(DISTINCT alias4.col1) DESC
| SELECT count(*) FROM table1 WHERE col1 = str
| 0.14221014492753623
|
| SELECT DISTINCT count(alias2.col1) FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col2 = alias2.col2 WHERE alias1.col3 = str
| SELECT alias3.col1 FROM table1 AS alias1 JOIN table2 AS alias2 ON alias1.col2 = alias2.col2 JOIN table3 AS alias3 ON alias2.col3 = alias3.col3 WHERE alias1.col4 = str AND alias1.col5 = str
| 0.5468686868686868
|
| SELECT count(*) FROM table1
| SELECT count(*) FROM table1 WHERE col1 LIKE str
| 0.6269230769230769
|
* Loss: [CoSENTLoss
](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "pairwise_cos_sim"
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `per_device_train_batch_size`: 160
- `learning_rate`: 2e-05
- `num_train_epochs`: 8
- `warmup_ratio`: 0.2
- `fp16`: True
- `dataloader_num_workers`: 16
- `batch_sampler`: no_duplicates
#### All Hyperparameters