File size: 4,947 Bytes
c56d57f be90834 c56d57f be90834 1a5e2e2 be90834 85cc185 dc9b472 85cc185 c508578 85cc185 be90834 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
pipeline_tag: sentence-similarity
tags:
- finetuner
- sentence-transformers
- feature-extraction
- sentence-similarity
datasets:
- jinaai/negation-dataset
language: en
license: apache-2.0
---
<br><br>
<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>
<p align="center">
<b>The text embedding set trained by Jina AI, Finetuner team.</b>
</p>
## Intented Usage & Model Info
`jina-embedding-t-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 14 million parameters,
the model enables lightning-fast inference while still delivering impressive performance.
Additionally, we provide the following options:
- `jina-embedding-t-en-v1`: 14 million parameters **(you are here)**.
- `jina-embedding-s-en-v1`: 35 million parameters.
- `jina-embedding-b-en-v1`: 110 million parameters.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).
## Data & Parameters
More info will be released together with the technique report.
## Metrics
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|Name|param |dimension|
|------------------------------|-----|------|
|all-minilm-l6-v2|23m |384|
|all-mpnet-base-v2 |110m |768|
|ada-embedding-002|Unknown/OpenAI API |8192|
|jina-embedding-t-en-v1|14m |312|
|jina-embedding-s-en-v1|35m |512|
|jina-embedding-b-en-v1|110m |768|
|jina-embedding-l-en-v1|330m |1024|
|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
|all-mpnet-base-v2|0.726|0.835|**0.78** |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
|ada-embedding-002|0.698|0.833|0.761|0.861|**0.86** |0.903|**0.685** |0.876|**0.726** |
|jina-embedding-t-en-v1|0.714|0.775|0.723|0.825|0.771|0.863|0.479 |0.841|0.542 |
|jina-embedding-s-en-v1|**0.743**|0.786|0.738|0.837|0.80|0.875|0.523 |0.857|0.524 |
|jina-embedding-b-en-v1|0.735|0.792|0.752|0.851|0.801|0.89|0.546 |0.871|0.586 |
|jina-embedding-l-en-v1|0.739|**0.844**|0.778|**0.863**|0.821|0.896|0.566 |**0.882**|0.608 |
## Inference Speed
We encoded a single sentence "What is the current weather like today?" 10k times on:
1. cpu: MacBook Pro 2020, 2 GHz Quad-Core Intel Core i5
2. gpu: 1 Nvidia 3090
And recorded time spent to demonstrate the embedding speed:
|Name|param |dimension| time@cpu | time@gpu |
|------------------------------|-----|------|-----|-----|
|jina-embedding-t-en-v1|14m |312| 5.78s | 2.36s|
|all-minilm-l6-v2|23m |384| 11.95s | 2.70s |
|jina-embedding-s-en-v1|35m |512| 17.25s | 2.81s |
## Usage
Use with Jina AI Finetuner
```python
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```
Use directly with sentence-transformers:
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['how is the weather today', 'What is the current weather like today?']
model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
```
## Fine-tuning
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
## Plans
1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
## Contact
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |