youval commited on
Commit
5fb98c8
1 Parent(s): a774abe

First README.md draft (#1)

Browse files

- Initial commit (1f4ddd44d3377a6256af8bfde021f34918f23c68)

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ language:
7
+ - de
8
+ - en
9
+ - es
10
+ - fr
11
+ - it
12
+ - nl
13
+ - ja
14
+ - pt
15
+ - zh
16
+ - pl
17
+ ---
18
+
19
+ # Model Card for `vectorizer.hazelnut`
20
+
21
+ This model is a vectorizer developed by Sinequa. It produces an embedding vector given a passage or a query. The
22
+ passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages
23
+ in the index.
24
+
25
+ Model name: `vectorizer.hazelnut`
26
+
27
+ ## Supported Languages
28
+
29
+ The model was trained and tested in the following languages:
30
+
31
+ - English
32
+ - French
33
+ - German
34
+ - Spanish
35
+ - Italian
36
+ - Dutch
37
+ - Japanese
38
+ - Portuguese
39
+ - Chinese (simplified)
40
+ - Polish
41
+
42
+ Besides these languages, basic support can be expected for additional 91 languages that were used during the pretraining
43
+ of the base model (see Appendix A of XLM-R paper).
44
+
45
+ ## Scores
46
+
47
+ | Metric | Value |
48
+ |:-------------------------------|------:|
49
+ | English Relevance (Recall@100) | 0.590 |
50
+ | Polish Relevance (Recall@100) | 0.543 |
51
+
52
+ Note that the relevance scores are computed as an average over several retrieval datasets (see
53
+ [details below](#evaluation-metrics)).
54
+
55
+ ## Inference Times
56
+
57
+ | GPU | Quantization type | Batch size 1 | Batch size 32 |
58
+ |:------------------------------------------|:------------------|---------------:|---------------:|
59
+ | NVIDIA A10 | FP16 | 1 ms | 5 ms |
60
+ | NVIDIA A10 | FP32 | 2 ms | 18 ms |
61
+ | NVIDIA T4 | FP16 | 1 ms | 12 ms |
62
+ | NVIDIA T4 | FP32 | 3 ms | 52 ms |
63
+ | NVIDIA L4 | FP16 | 2 ms | 5 ms |
64
+ | NVIDIA L4 | FP32 | 4 ms | 24 ms |
65
+
66
+ ## Gpu Memory usage
67
+
68
+ | Quantization type | Memory |
69
+ |:-------------------------------------------------|-----------:|
70
+ | FP16 | 550 MiB |
71
+ | FP32 | 1050 MiB |
72
+
73
+ Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch
74
+ size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which
75
+ can be around 0.5 to 1 GiB depending on the used GPU.
76
+
77
+ ## Requirements
78
+
79
+ - Minimal Sinequa version: 11.10.0
80
+ - Minimal Sinequa version for using FP16 models and GPUs with CUDA compute capability of 8.9+ (like NVIDIA L4): 11.11.0
81
+ - [Cuda compute capability](https://developer.nvidia.com/cuda-gpus): above 5.0 (above 6.0 for FP16 use)
82
+
83
+ ## Model Details
84
+
85
+ ### Overview
86
+
87
+ - Number of parameters: 107 million
88
+ - Base language
89
+ model: [mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) ([Paper](https://arxiv.org/abs/2012.15828), [GitHub](https://github.com/microsoft/unilm/tree/master/minilm))
90
+ - Insensitive to casing and accents
91
+ - Output dimensions: 256 (reduced with an additional dense layer)
92
+ - Training procedure: Query-passage-negative triplets for datasets that have mined hard negative data, Query-passage
93
+ pairs for the rest. Number of negatives is augmented with in-batch negative strategy
94
+
95
+ ### Training Data
96
+
97
+ The model have been trained using all datasets that are cited in
98
+ the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
99
+ In addition to that, this model has been trained on the datasets cited
100
+ in [this paper](https://arxiv.org/pdf/2108.13897.pdf) on the first 9 aforementioned languages.
101
+ It has also been trained on [this dataset](https://huggingface.co/datasets/clarin-knext/msmarco-pl) for polish capacities.
102
+
103
+ ### Evaluation Metrics
104
+
105
+ #### English
106
+
107
+ To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
108
+ [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in **English**.
109
+
110
+ | Dataset | Recall@100 |
111
+ |:------------------|-----------:|
112
+ | Average | 0.590 |
113
+ | | |
114
+ | Arguana | 0.961 |
115
+ | CLIMATE-FEVER | 0.432 |
116
+ | DBPedia Entity | 0.371 |
117
+ | FEVER | 0.723 |
118
+ | FiQA-2018 | 0.611 |
119
+ | HotpotQA | 0.564 |
120
+ | MS MARCO | 0.825 |
121
+ | NFCorpus | 0.266 |
122
+ | NQ | 0.722 |
123
+ | Quora | 0.991 |
124
+ | SCIDOCS | 0.426 |
125
+ | SciFact | 0.864 |
126
+ | TREC-COVID | 0.092 |
127
+ | Webis-Touche-2020 | 0.415 |
128
+
129
+ #### Polish
130
+
131
+ This model has polish capacities, that are being evaluated over a subset of the [PIRBenchmark](https://github.com/sdadas/pirb).
132
+
133
+ | Dataset | Recall@100 |
134
+ |:------------------|-----------:|
135
+ | Average | 0.534 |
136
+ | | |
137
+ | arguana-pl | 0.909 |
138
+ | dbpedia-pl | 0.282 |
139
+ | fiqa-pl | 0.439 |
140
+ | hotpotqa-pl | 0.530 |
141
+ | msmarco-pl | 0.694 |
142
+ | nfcorpus-pl | 0.218 |
143
+ | nq-pl | 0.697 |
144
+ | quora-pl | 0.949 |
145
+ | scidocs-pl | 0.291 |
146
+ | scifact-pl | 0.805 |
147
+ | trec-covid-pl | 0.059 |
148
+
149
+ #### Other languages
150
+
151
+ We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its
152
+ multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics
153
+ for the existing languages.
154
+
155
+ | Language | Recall@100 |
156
+ |:----------------------|-----------:|
157
+ | French | 0.649 |
158
+ | German | 0.598 |
159
+ | Spanish | 0.609 |
160
+ | Japanese | 0.623 |
161
+ | Chinese (simplified) | 0.707 |
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large",
3
+ "architectures": [
4
+ "XLMRobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 384,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 1536,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 6,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "transformers_version": "4.25.1",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 250002
26
+ }
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6d1059af50788d7e1cf263a8cf1b553bc55716ae9b89afddabd6abf7cd5dd5b
3
+ size 428012973
reduction_layer.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bf0496af06818c85b6d268c84aaec7913eaeb665d71f5451b50c9e9c5758b4a
3
+ size 395271
sinequa.metadata.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "score-scaling-factor": 3.0
3
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4011b5810b74e5b6348c7d6458b9dda20b5af6b759dc999f113c31888c6b6eb
3
+ size 17083132