jupyterjazz
commited on
readme: usage and performance
Browse files
README.md
CHANGED
@@ -160,6 +160,47 @@ The data and training details are described in the technical report (coming soon
|
|
160 |
|
161 |
## Usage
|
162 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
163 |
1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
164 |
2. Alternatively, you can use Jina CLIP directly via transformers package.
|
165 |
|
|
|
160 |
|
161 |
## Usage
|
162 |
|
163 |
+
**<details><summary>Apply mean pooling when integrating the model.</summary>**
|
164 |
+
<p>
|
165 |
+
|
166 |
+
### Why Use Mean Pooling?
|
167 |
+
|
168 |
+
Mean pooling takes all token embeddings from the model's output and averages them at the sentence or paragraph level.
|
169 |
+
This approach has been shown to produce high-quality sentence embeddings.
|
170 |
+
|
171 |
+
We provide an `encode` function that handles this for you automatically.
|
172 |
+
|
173 |
+
However, if you're working with the model directly, outside of the `encode` function,
|
174 |
+
you'll need to apply mean pooling manually. Here's how you can do it:
|
175 |
+
|
176 |
+
|
177 |
+
```python
|
178 |
+
import torch
|
179 |
+
import torch.nn.functional as F
|
180 |
+
from transformers import AutoTokenizer, AutoModel
|
181 |
+
|
182 |
+
def mean_pooling(model_output, attention_mask):
|
183 |
+
token_embeddings = model_output[0]
|
184 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
185 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
186 |
+
|
187 |
+
sentences = ['How is the weather today?', 'What is the current weather like today?']
|
188 |
+
|
189 |
+
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
|
190 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
|
191 |
+
|
192 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
193 |
+
|
194 |
+
with torch.no_grad():
|
195 |
+
model_output = model(**encoded_input)
|
196 |
+
|
197 |
+
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
198 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
199 |
+
```
|
200 |
+
|
201 |
+
</p>
|
202 |
+
</details>
|
203 |
+
|
204 |
1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
205 |
2. Alternatively, you can use Jina CLIP directly via transformers package.
|
206 |
|