Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,73 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
4 |
+
# SHerbert large - Polish SentenceBERT
|
5 |
+
SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.
|
6 |
+
|
7 |
+
> Semantic textual similarity analyzes how similar two pieces of texts are.
|
8 |
+
|
9 |
+
Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).
|
10 |
+
|
11 |
+
The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".
|
12 |
+
|
13 |
+
# Corpus
|
14 |
+
Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).
|
15 |
+
|
16 |
+
|
17 |
+
# Tokenizer
|
18 |
+
|
19 |
+
As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
|
20 |
+
|
21 |
+
We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
|
22 |
+
|
23 |
+
# Usage
|
24 |
+
|
25 |
+
```python
|
26 |
+
from transformers import AutoTokenizer, AutoModel
|
27 |
+
from sklearn.metrics import pairwise
|
28 |
+
|
29 |
+
sbert = AutoModel.from_pretrained("Voicelab/sherbert-large-cased")
|
30 |
+
tokenizer = AutoTokenizer.from_pretrained("Voicelab/sherbert-large-cased")
|
31 |
+
|
32 |
+
s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
|
33 |
+
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
|
34 |
+
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
|
35 |
+
base
|
36 |
+
|
37 |
+
tokens = tokenizer([s0, s1, s2],
|
38 |
+
padding=True,
|
39 |
+
truncation=True,
|
40 |
+
return_tensors='pt')
|
41 |
+
x = sbert(tokens["input_ids"],
|
42 |
+
tokens["attention_mask"]).pooler_output
|
43 |
+
|
44 |
+
# similarity between sentences s0 and s1
|
45 |
+
print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354
|
46 |
+
|
47 |
+
# similarity between sentences s0 and s2
|
48 |
+
print(pairwise.cosine_similarity(x[0], x[2))) # Result: 0.42359722
|
49 |
+
|
50 |
+
```
|
51 |
+
# Results
|
52 |
+
|
53 |
+
| Model | Accuracy | Source |
|
54 |
+
|--------------------------|------------|----------------------------------------------------------|
|
55 |
+
| SBERT-WikiSec-base (EN) | 80.42% | https://arxiv.org/abs/1908.10084 |
|
56 |
+
| SBERT-WikiSec-large (EN) | 80.78% | https://arxiv.org/abs/1908.10084 |
|
57 |
+
| SHerbert-base (PL) | 82.31% | https://huggingface.co/Voicelab/sherbert-base-cased |
|
58 |
+
| **SHerbert-large (PL)** | **84.42%** | **https://huggingface.co/Voicelab/sherbert-large-cased** |
|
59 |
+
|
60 |
+
# License
|
61 |
+
|
62 |
+
CC BY 4.0
|
63 |
+
|
64 |
+
# Citation
|
65 |
+
|
66 |
+
If you use this model, please cite the following paper:
|
67 |
+
|
68 |
+
|
69 |
+
# Authors
|
70 |
+
|
71 |
+
The model was trained by NLP Research Team at Voicelab.ai.
|
72 |
+
|
73 |
+
You can contact us [here](https://voicelab.ai/contact/).
|