piotr-rybak commited on
Commit
fa15921
1 Parent(s): 249e047

init commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language:
4
+ - pl
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ datasets:
11
+ - ipipan/polqa
12
+ - ipipan/maupqa
13
+ ---
14
+
15
+ # HerBERT-base Retrieval (v2)
16
+
17
+ HerBERT Retrieval model encodes the Polish sentences or paragraphs into a 768-dimensional dense vector space and can be used for tasks like document retrieval or semantic search.
18
+
19
+ It was initialized from the [HerBERT-base](https://huggingface.co/allegro/herbert-base-cased) model and fine-tuned on the [PolQA](https://huggingface.co/ipipan/polqa) and [MAUPQA](https://huggingface.co/ipipan/maupqa) datasets for 40,000 steps with a batch size of 256.
20
+
21
+ The model was trained on question-passage pairs and works best on similar tasks. The training passages consisted of `title` and `text` concatenated with the special token `</s>`. Even if your passages don't have a `title`, it is still beneficial to prefix a passage `text` with the `</s>` token.
22
+ ## Usage (Sentence-Transformers)
23
+
24
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
25
+
26
+ ```
27
+ pip install -U sentence-transformers
28
+ ```
29
+
30
+ Then you can use the model like this:
31
+
32
+ ```python
33
+ from sentence_transformers import SentenceTransformer
34
+ sentences = [
35
+ "W jakim mieście urodził się Zbigniew Herbert?",
36
+ "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
37
+ ]
38
+
39
+ model = SentenceTransformer('ipipan/herbert-base-retrieval-v2')
40
+ embeddings = model.encode(sentences)
41
+ print(embeddings)
42
+ ```
43
+
44
+ ## Usage (HuggingFace Transformers)
45
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModel
49
+ import torch
50
+
51
+
52
+ def cls_pooling(model_output, attention_mask):
53
+ return model_output[0][:,0]
54
+
55
+
56
+ # Sentences we want sentence embeddings for
57
+ sentences = [
58
+ "W jakim mieście urodził się Zbigniew Herbert?",
59
+ "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
60
+ ]
61
+ # Load model from HuggingFace Hub
62
+ tokenizer = AutoTokenizer.from_pretrained('ipipan/herbert-base-retrieval-v2')
63
+ model = AutoModel.from_pretrained('ipipan/herbert-base-retrieval-v2')
64
+
65
+ # Tokenize sentences
66
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
67
+
68
+ # Compute token embeddings
69
+ with torch.no_grad():
70
+ model_output = model(**encoded_input)
71
+
72
+ # Perform pooling. In this case, cls pooling.
73
+ sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
74
+
75
+ print("Sentence embeddings:")
76
+ print(sentence_embeddings)
77
+ ```
78
+
79
+ ## Full Model Architecture
80
+ ```
81
+ SentenceTransformer(
82
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
83
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
84
+ )
85
+ ```
86
+
87
+ ## Additional Information
88
+
89
+ ### Dataset Curators
90
+
91
+ The model was created by Piotr Rybak from the [Institute of Computer Science, Polish Academy of Sciences](http://zil.ipipan.waw.pl/).
92
+
93
+ ### Licensing Information
94
+
95
+ [More Information Needed]
96
+
97
+ ### Citation Information
98
+
99
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../../../sellaservice/rt/model_all_filter5p_bs256_lr2e5_40k/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 514,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 1,
25
+ "position_embedding_type": "absolute",
26
+ "tokenizer_class": "HerbertTokenizerFast",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.30.1",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 50000
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.30.1",
5
+ "pytorch": "2.0.1"
6
+ }
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef7f20de52fd7d1a0c2d5efbbc473cc16034278c168d073b00e5a63c5d1ddf26
3
+ size 497839917
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "mask_token": "<mask>",
5
+ "pad_token": "<pad>",
6
+ "sep_token": "</s>",
7
+ "unk_token": "<unk>"
8
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [],
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "do_lowercase_and_remove_accent": false,
7
+ "id2lang": null,
8
+ "lang2id": null,
9
+ "mask_token": "<mask>",
10
+ "model_max_length": 512,
11
+ "pad_token": "<pad>",
12
+ "sep_token": "</s>",
13
+ "tokenizer_class": "HerbertTokenizer",
14
+ "unk_token": "<unk>"
15
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff