|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- krc |
|
--- |
|
|
|
# TSjB/labse-qm |
|
|
|
It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
Fine-tined by [Bogdan Tewunalany](https://t.me/bogdan_tewunalany) |
|
Based on [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) |
|
<!--- Describe your model here --> |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
### Python: |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["This is an example sentence", "Бу айтым юлгюдю"] |
|
|
|
model = SentenceTransformer('TSjB/labse-qm') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
### R language: |
|
|
|
```r |
|
library(data.table) |
|
library(reticulate) |
|
library(ggplot2) |
|
library(ggrepel) |
|
library(Rtsne) |
|
|
|
py_install("sentence-transformers", pip = TRUE) |
|
st <- import("sentence_transformers") |
|
|
|
english_sentences = base::c("dog", "Puppies are nice.", "I enjoy taking long walks along the beach with my dog.") |
|
italian_sentences = base::c("cane", "I cuccioli sono carini.", "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.") |
|
qarachay_sentences = base::c("ит", "Итле джагъымлыдыла.", "Джагъа юсю бла итим бла айланыргъа сюеме.") |
|
|
|
model = st$SentenceTransformer('TSjB/labse-qm') |
|
|
|
english_embeddings = model$encode(english_sentences) |
|
italian_embeddings = model$encode(italian_sentences) |
|
qarachay_embeddings = model$encode(qarachay_sentences) |
|
|
|
m <- rbind(english_embeddings, |
|
italian_embeddings, |
|
qarachay_embeddings) %>% as.matrix |
|
|
|
tsne <- Rtsne(m, perplexity = floor((nrow(m) - 1) / 3)) |
|
|
|
|
|
tSNE_df <- tsne$Y %>% |
|
as.data.table() %>% |
|
setnames(old = c("V1", "V2"), new = c("tSNE1", "tSNE2")) %>% |
|
.[, `:=`(sentence = c(english_sentences, italian_sentences, qarachay_sentences), |
|
language = c(rep("english", length(english_sentences)), |
|
rep("italian", length(italian_sentences)), |
|
rep("qarachay", length(qarachay_sentences))))] |
|
|
|
|
|
tSNE_df %>% |
|
ggplot(aes(x = tSNE1, |
|
y = tSNE2, |
|
color = language, |
|
label = sentence |
|
) |
|
) + |
|
geom_label_repel() + |
|
geom_point() |
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
<!--- Describe how your model was evaluated --> |
|
|
|
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME}) |
|
|
|
|
|
## Training |
|
The model was trained with the parameters: |
|
|
|
**DataLoader**: |
|
|
|
`torch.utils.data.dataloader.DataLoader` of length 6439 with parameters: |
|
``` |
|
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
|
``` |
|
|
|
**Loss**: |
|
|
|
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: |
|
``` |
|
{'scale': 20.0, 'similarity_fct': 'cos_sim'} |
|
``` |
|
|
|
Parameters of the fit()-Method: |
|
``` |
|
{ |
|
"epochs": 1, |
|
"evaluation_steps": 100, |
|
"evaluator": "__main__.ChainScoreEvaluator", |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 2e-05 |
|
}, |
|
"scheduler": "warmupcosine", |
|
"steps_per_epoch": null, |
|
"warmup_steps": 1000, |
|
"weight_decay": 0.01 |
|
} |
|
``` |
|
|
|
|
|
## Full Model Architecture |
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}) |
|
(2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'}) |
|
(3): Normalize() |
|
) |
|
``` |