SentenceTransformer

This is a finetuned version of bge-m3 for the task of SQL table retrieval and ranking.

Model Details

This model can be used to identify relevant SQL tables for query to SQL translation.

The model was finetuned using a curated dataset of SQL table definitions and corresponding natural language queries. The script used for finetuning: Flag Embeddings

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'What is the total biomass of fish in farms with a water temperature above 25 degrees Celsius?',
    'CREATE TABLE Farm (FarmID INT, FarmName VARCHAR(50), WaterTemperature DECIMAL, Biomass DECIMAL)',
    'CREATE TABLE Locations (id INT PRIMARY KEY, name VARCHAR(50), region VARCHAR(50), depth INT)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.0.1
Transformers: 4.42.3
PyTorch: 2.3.1+cu121
Accelerate: 0.31.0
Datasets: 2.20.0
Tokenizers: 0.19.1

RaduGabriel
/

BGE-M3-SQL