alaeddine-13 commited on
Commit
39b8f6d
·
1 Parent(s): 0091541

README.md draft

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - finetuner
5
+ - mteb
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - alibi
10
+ datasets:
11
+ - allenai/c4
12
+ language: en
13
+ license: apache-2.0
14
+ model-index:
15
+ - name: jina-embedding-b-en-v2
16
+ results: []
17
+ ---
18
+ <!-- TODO: add evaluation results here -->
19
+ <br><br>
20
+
21
+ <p align="center">
22
+ <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
23
+ </p>
24
+
25
+
26
+ <p align="center">
27
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>, <a href="https://github.com/jina-ai/finetuner"><b>Finetuner</b></a> team.</b>
28
+ </p>
29
+
30
+
31
+ ## Intended Usage & Model Info
32
+
33
+ `jina-embedding-b-en-v2` is an English, monolingual embedding model supporting 8k sequence length.
34
+ It is based on a Bert architecture that supports the symmetric bidirectional variant of ALiBi to support longer sequence length.
35
+ The backbone Jina Bert Small model is pretrained on the C4 dataset.
36
+ The model is further trained on Jina AI's collection of more than 40 datasets of sentence pairs and hard negatives.
37
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
38
+
39
+ The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length thanks to ALiBi.
40
+ This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
41
+
42
+ This model has 33 million parameters, which enables lightning-fast and memory efficient inference on long documents, while still delivering impressive performance.
43
+ Additionally, we provide the following embedding models, supporting 8k sequence length as well:
44
+
45
+ - [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters.
46
+ - [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters **(you are here)**.
47
+ - [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
48
+
49
+ ## Data & Parameters
50
+ <!-- TODO: update the paper ID once it is published on arxiv -->
51
+ Please checkout our [technical blog](https://arxiv.org/abs/2307.11224).
52
+
53
+ ## Metrics
54
+
55
+ We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
56
+
57
+ <!-- TODO: add evaluation table here -->
58
+
59
+ ## Usage
60
+
61
+ You can use Jina Embedding models directly from transformers package:
62
+ ```python
63
+ !pip install transformers
64
+ from transformers import AutoModel
65
+ from numpy.linalg import norm
66
+
67
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
68
+ model = AutoModel.from_pretrained('jinaai/jina-embedding-b-en-v2', trust_remote_code=True) # trust_remote_code is needed to use the encode method
69
+ embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
70
+ print(cos_sim(embeddings[0], embeddings[1]))
71
+ ```
72
+
73
+ For long sequences, it's recommended to perform inference using Flash Attention. Using Flash Attention allows you to increase the batch size and throughput for long sequence length.
74
+ We include an experimental implementation for Flash Attention, shipped with the model.
75
+ Install the following triton version:
76
+ `pip install triton==2.0.0.dev20221202`.
77
+ Now run the same code above, but make sure to set the parameter `with_flash` to `True` when you load the model. You also have to use either `fp16` or `bf16`:
78
+ ```python
79
+ from transformers import AutoModel
80
+ from numpy.linalg import norm
81
+ import torch
82
+
83
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
84
+ model = AutoModel.from_pretrained('jinaai/jina-embedding-b-en-v2', trust_remote_code=True, with_flash=True, torch_dtype=torch.float16).cuda() # trust_remote_code is needed to use the encode method
85
+ embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
86
+ print(cos_sim(embeddings[0], embeddings[1]))
87
+ ```
88
+
89
+ ## Fine-tuning
90
+
91
+ Please consider [Finetuner](https://github.com/jina-ai/finetuner).
92
+
93
+ ## Plans
94
+ The development of new multilingual models is currently underway. We will be targeting mainly the German and Spanish languages. The upcoming models will be called `jina-embedding-s/b/l-de/es-v2`.
95
+
96
+ ## Contact
97
+
98
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
99
+
100
+ ## Citation
101
+
102
+ If you find Jina Embeddings useful in your research, please cite the following paper:
103
+
104
+ <!-- TODO: update the paper ID once it is published on arxiv -->
105
+ ``` latex
106
+ @misc{günther2023jina,
107
+ title={Beyond the 512-Token Barrier: Training General-Purpose Text
108
+ Embeddings for Large Documents},
109
+ author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang},
110
+ year={2023},
111
+ eprint={2307.11224},
112
+ archivePrefix={arXiv},
113
+ primaryClass={cs.CL}
114
+ }
115
+ ```