michaelfeil commited on
Commit
18ed68f
1 Parent(s): 42dc317

Upload jinaai/jina-embedding-t-en-v1 ctranslate2 weights

Browse files
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - ctranslate2
5
+ - int8
6
+ - float16
7
+ - finetuner
8
+ - sentence-transformers
9
+ - feature-extraction
10
+ - sentence-similarity
11
+ datasets:
12
+ - jinaai/negation-dataset
13
+ language: en
14
+ license: apache-2.0
15
+ ---
16
+ # # Fast-Inference with Ctranslate2
17
+ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
18
+
19
+ quantized version of [jinaai/jina-embedding-t-en-v1](https://huggingface.co/jinaai/jina-embedding-t-en-v1)
20
+ ```bash
21
+ pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
22
+ ```
23
+
24
+ ```python
25
+ # from transformers import AutoTokenizer
26
+ model_name = "michaelfeil/ct2fast-jina-embedding-t-en-v1"
27
+ model_name_orig="jinaai/jina-embedding-t-en-v1"
28
+
29
+ from hf_hub_ctranslate2 import EncoderCT2fromHfHub
30
+ model = EncoderCT2fromHfHub(
31
+ # load in int8 on CUDA
32
+ model_name_or_path=model_name,
33
+ device="cuda",
34
+ compute_type="int8_float16"
35
+ )
36
+ outputs = model.generate(
37
+ text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
38
+ max_length=64,
39
+ ) # perform downstream tasks on outputs
40
+ outputs["pooler_output"]
41
+ outputs["last_hidden_state"]
42
+ outputs["attention_mask"]
43
+
44
+ # alternative, use SentenceTransformer Mix-In
45
+ # for end-to-end Sentence embeddings generation
46
+ # (not pulling from this CT2fast-HF repo)
47
+
48
+ from hf_hub_ctranslate2 import CT2SentenceTransformer
49
+ model = CT2SentenceTransformer(
50
+ model_name_orig, compute_type="int8_float16", device="cuda"
51
+ )
52
+ embeddings = model.encode(
53
+ ["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
54
+ batch_size=32,
55
+ convert_to_numpy=True,
56
+ normalize_embeddings=True,
57
+ )
58
+ print(embeddings.shape, embeddings)
59
+ scores = (embeddings @ embeddings.T) * 100
60
+
61
+ # Hint: you can also host this code via REST API and
62
+ # via github.com/michaelfeil/infinity
63
+
64
+
65
+ ```
66
+
67
+ Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2)
68
+ and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
69
+ - `compute_type=int8_float16` for `device="cuda"`
70
+ - `compute_type=int8` for `device="cpu"`
71
+
72
+ Converted on 2023-10-13 using
73
+ ```
74
+ LLama-2 -> removed <pad> token.
75
+ ```
76
+
77
+ # Licence and other remarks:
78
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
79
+
80
+ # Original description
81
+
82
+
83
+ <br><br>
84
+
85
+ <p align="center">
86
+ <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
87
+ </p>
88
+
89
+
90
+ <p align="center">
91
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>, <a href="https://github.com/jina-ai/finetuner"><b>Finetuner</b></a> team.</b>
92
+ </p>
93
+
94
+
95
+ ## Intented Usage & Model Info
96
+
97
+ `jina-embedding-t-en-v1` is a tiny language model that has been trained using Jina AI's Linnaeus-Clean dataset.
98
+ This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
99
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
100
+ The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
101
+
102
+ The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
103
+
104
+ With a tiny small parameter size of just 14 million parameters,
105
+ the model enables lightning-fast inference on CPU, while still delivering impressive performance.
106
+ Additionally, we provide the following options:
107
+
108
+ - [`jina-embedding-t-en-v1`](https://huggingface.co/jinaai/jina-embedding-t-en-v1): 14 million parameters **(you are here)**.
109
+ - [`jina-embedding-s-en-v1`](https://huggingface.co/jinaai/jina-embedding-s-en-v1): 35 million parameters.
110
+ - [`jina-embedding-b-en-v1`](https://huggingface.co/jinaai/jina-embedding-b-en-v1): 110 million parameters.
111
+ - [`jina-embedding-l-en-v1`](https://huggingface.co/jinaai/jina-embedding-l-en-v1): 330 million parameters.
112
+ - `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10 times bert-base (soon).
113
+ - `jina-embedding-6b-en-v1`: 6 billion parameters, 30 times bert-base (soon).
114
+
115
+ ## Data & Parameters
116
+
117
+ Please checkout our [technical blog](https://arxiv.org/abs/2307.11224).
118
+
119
+ ## Metrics
120
+
121
+ We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
122
+
123
+ |Name|param |dimension|
124
+ |------------------------------|-----|------|
125
+ |all-minilm-l6-v2|23m |384|
126
+ |all-mpnet-base-v2 |110m |768|
127
+ |ada-embedding-002|Unknown/OpenAI API |1536|
128
+ |jina-embedding-t-en-v1|14m |312|
129
+ |jina-embedding-s-en-v1|35m |512|
130
+ |jina-embedding-b-en-v1|110m |768|
131
+ |jina-embedding-l-en-v1|330m |1024|
132
+
133
+
134
+ |Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
135
+ |------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
136
+ |all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
137
+ |all-mpnet-base-v2|0.726|**0.835**|0.78 |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
138
+ |ada-embedding-002|0.698|0.833|0.761|0.861|**0.86** |0.903|**0.685** |0.876|**0.726** |
139
+ |jina-embedding-t-en-v1|0.717|0.773|0.731|0.829|0.777|0.860|0.482 |0.840|0.522 |
140
+ |jina-embedding-s-en-v1|0.743|0.786|0.738|0.837|0.80|0.875|0.523 |0.857|0.524 |
141
+ |jina-embedding-b-en-v1|**0.751**|0.809|0.761|0.856|0.812|0.890|0.606 |0.876|0.594 |
142
+ |jina-embedding-l-en-v1|0.745|0.832|**0.781**|**0.869**|0.837|0.902|0.573 |**0.881**|0.598 |
143
+
144
+ ## Inference Speed
145
+
146
+ We encoded a single sentence "What is the current weather like today?" 10k times on:
147
+
148
+ 1. cpu: MacBook Pro 2020, 2 GHz Quad-Core Intel Core i5
149
+ 2. gpu: 1 Nvidia 3090
150
+
151
+ And recorded time spent to demonstrate the embedding speed:
152
+
153
+ |Name|param |dimension| time@cpu | time@gpu |
154
+ |------------------------------|-----|------|-----|-----|
155
+ |jina-embedding-t-en-v1|14m |312| 5.78s | 2.36s|
156
+ |all-minilm-l6-v2|23m |384| 11.95s | 2.70s |
157
+ |jina-embedding-s-en-v1|35m |512| 17.25s | 2.81s |
158
+
159
+
160
+ ## Usage
161
+
162
+ Use with Jina AI Finetuner
163
+
164
+ ```python
165
+ !pip install finetuner
166
+ import finetuner
167
+
168
+ model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
169
+ embeddings = finetuner.encode(
170
+ model=model,
171
+ data=['how is the weather today', 'What is the current weather like today?']
172
+ )
173
+ print(finetuner.cos_sim(embeddings[0], embeddings[1]))
174
+ ```
175
+
176
+ Use with sentence-transformers:
177
+
178
+ ```python
179
+ from sentence_transformers import SentenceTransformer
180
+ from sentence_transformers.util import cos_sim
181
+
182
+ sentences = ['how is the weather today', 'What is the current weather like today?']
183
+
184
+ model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')
185
+ embeddings = model.encode(sentences)
186
+ print(cos_sim(embeddings[0], embeddings[1]))
187
+ ```
188
+
189
+ ## Fine-tuning
190
+
191
+ Please consider [Finetuner](https://github.com/jina-ai/finetuner).
192
+
193
+ ## Plans
194
+
195
+ 1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
196
+ 2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
197
+
198
+ ## Contact
199
+
200
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
201
+
202
+ ## Citation
203
+
204
+ If you find Jina Embeddings useful in your research, please cite the following paper:
205
+
206
+ ``` latex
207
+ @misc{günther2023jina,
208
+ title={Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models},
209
+ author={Michael Günther and Louis Milliken and Jonathan Geuter and Georgios Mastrapas and Bo Wang and Han Xiao},
210
+ year={2023},
211
+ eprint={2307.11224},
212
+ archivePrefix={arXiv},
213
+ primaryClass={cs.CL}
214
+ }
215
+ ```
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "tmp/",
3
+ "attention_probs_dropout_prob": 0.1,
4
+ "cell": {},
5
+ "model_type": "bert",
6
+ "emb_size": 312,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 312,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1200,
12
+ "max_position_embeddings": 512,
13
+ "num_attention_heads": 12,
14
+ "num_hidden_layers": 4,
15
+ "pre_trained": "",
16
+ "structure": [],
17
+ "type_vocab_size": 2,
18
+ "vocab_size": 30522,
19
+ "bos_token": "<s>",
20
+ "eos_token": "</s>",
21
+ "layer_norm_epsilon": 1e-12,
22
+ "unk_token": "[UNK]"
23
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dffc2308cb6291951ed6025ca2e9234d1011913a5496294d7dc545ec999ba824
3
+ size 28703620
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 1000000000000000019884624838656,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
vocabulary.txt ADDED
The diff for this file is too large to render. See raw diff