File size: 2,658 Bytes
ba1318a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# SHerbert - Polish SentenceBERT
SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.

> Semantic textual similarity analyzes how similar two pieces of texts are.

Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).

The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".

# Corpus
Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).


# Tokenizer

As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library. 

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

# Usage

 ```python
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import pairwise

sbert = AutoModel.from_pretrained("voicelab/sbert-base")
tokenizer = AutoTokenizer.from_pretrained("voicelab/sbert-base")

s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "


tokens = tokenizer([s0, s1, s2], 
                    padding=True, 
                    truncation=True,
                    return_tensors='pt')
x = sbert(tokens["input_ids"],
            tokens["attention_mask"]).pooler_output

# similarity between sentences s0 and s1
print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354

# similarity between sentences s0 and s2
print(pairwise.cosine_similarity(x[0], x[2))) # Result: 0.42359722
    
 ```


# License

CC BY 4.0

# Citation

If you use this model, please cite the following paper:


# Authors

The model was trained by NLP Research Team at Voicelab.ai.

You can contact us [here](https://voicelab.ai/contact/).