bwang0911 commited on
Commit
be90834
1 Parent(s): 7e00234

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md CHANGED
@@ -1,3 +1,115 @@
1
  ---
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - finetuner
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ datasets:
9
+ - jinaai/negation-dataset
10
+ language: en
11
  license: apache-2.0
12
  ---
13
+
14
+ <br><br>
15
+
16
+ <p align="center">
17
+ <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
18
+ </p>
19
+
20
+
21
+ <p align="center">
22
+ <b>The text embedding set trained by Jina AI, Finetuner team.</b>
23
+ </p>
24
+
25
+
26
+ ## Intented Usage & Model Info
27
+
28
+ `jina-embedding-t-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
29
+ This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
30
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
31
+ The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
32
+
33
+ The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
34
+
35
+ With a compact size of just 14 million parameters,
36
+ the model enables lightning-fast inference while still delivering impressive performance.
37
+ Additionally, we provide the following options:
38
+
39
+ - `jina-embedding-t-en-v1`: 14 million parameters **(you are here)**.
40
+ - `jina-embedding-s-en-v1`: 35 million parameters.
41
+ - `jina-embedding-b-en-v1`: 110 million parameters.
42
+ - `jina-embedding-l-en-v1`: 330 million parameters.
43
+ - `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
44
+ - `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).
45
+
46
+ ## Data & Parameters
47
+
48
+ More info will be released together with the technique report.
49
+
50
+ ## Metrics
51
+
52
+ We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
53
+
54
+ |Name|param |dimension|
55
+ |------------------------------|-----|------|
56
+ |all-minilm-l6-v2|33m |384|
57
+ |all-mpnet-base-v2 |110m |768|
58
+ |ada-embedding-002|Unknown/OpenAI API |8192|
59
+ |jina-embedding-t-en-v1|14m |312|
60
+ |jina-embedding-s-en-v1|35m |512|
61
+ |jina-embedding-b-en-v1|110m |768|
62
+ |jina-embedding-l-en-v1|330m |1024|
63
+
64
+
65
+ |Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
66
+ |------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
67
+ |all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
68
+ |all-mpnet-base-v2|0.726|0.835|**0.78** |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
69
+ |ada-embedding-002|0.698|0.833|0.761|0.861|**0.86** |0.903|**0.685** |0.876|**0.726** |
70
+ |jina-embedding-t-en-v1|0.714|0.775|0.723|0.825|0.771|0.863|0.479 |0.841|0.542 |
71
+ |jina-embedding-s-en-v1|**0.743**|0.786|0.738|0.837|0.80|0.875|0.523 |0.857|0.524 |
72
+ |jina-embedding-b-en-v1|0.735|0.792|0.752|0.851|0.801|0.89|0.546 |0.871|0.586 |
73
+ |jina-embedding-l-en-v1|0.739|**0.844**|0.778|**0.863**|0.821|0.896|0.566 |**0.882**|0.608 |
74
+
75
+ ## Usage
76
+
77
+ Use with Jina AI Finetuner
78
+
79
+ ```python
80
+ !pip install finetuner
81
+ import finetuner
82
+
83
+ model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
84
+ embeddings = finetuner.encode(
85
+ model=model,
86
+ data=['how is the weather today', 'What is the current weather like today?']
87
+ )
88
+ print(finetuner.cos_sim(embeddings[0], embeddings[1]))
89
+ ```
90
+
91
+ Use directly with sentence-transformers:
92
+
93
+ ```python
94
+ from sentence_transformers import SentenceTransformer
95
+ from sentence_transformers.util import cos_sim
96
+
97
+ sentences = ['how is the weather today', 'What is the current weather like today?']
98
+
99
+ model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')
100
+ embeddings = model.encode(sentences)
101
+ print(cos_sim(embeddings[0], embeddings[1]))
102
+ ```
103
+
104
+ ## Fine-tuning
105
+
106
+ Please consider [Finetuner](https://github.com/jina-ai/finetuner).
107
+
108
+ ## Plans
109
+
110
+ 1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
111
+ 2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
112
+
113
+ ## Contact
114
+
115
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.