innocent-charles commited on
Commit
cb9297d
·
verified ·
1 Parent(s): b40010f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -120,8 +120,13 @@ license: apache-2.0
120
  ---
121
 
122
  # AviLaBSE
123
- This is a port of the [LaBSE](https://tfhub.dev/google/LaBSE/1) model to PyTorch. It can be used to map 109 languages to a shared vector space.
124
 
 
 
 
 
 
125
 
126
  ## Usage (Sentence-Transformers)
127
 
@@ -142,6 +147,71 @@ embeddings = model.encode(sentences)
142
  print(embeddings)
143
  ```
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
 
147
  ## Evaluation Results
@@ -164,5 +234,4 @@ SentenceTransformer(
164
 
165
  ## Citing & Authors
166
 
167
- Have a look at [LaBSE](https://tfhub.dev/google/LaBSE/1) for the respective publication that describes LaBSE.
168
-
 
120
  ---
121
 
122
  # AviLaBSE
123
+ This is a port of the [LaBSE](https://tfhub.dev/google/LaBSE/1) model to PyTorch. Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. It can be used to map 109 languages to a shared vector space. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
124
 
125
+ - Model: [HuggingFace's model hub](https://huggingface.co/sartifyllc/AviLaBSE).
126
+ - Paper: [arXiv](https://arxiv.org/abs/2007.01852).
127
+ - Original model: [TensorFlow Hub](https://tfhub.dev/google/LaBSE/2).
128
+ - Blog post: [Google AI Blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html).
129
+ - Conversion from TensorFlow to PyTorch: [GitHub](https://github.com/sartify).
130
 
131
  ## Usage (Sentence-Transformers)
132
 
 
147
  print(embeddings)
148
  ```
149
 
150
+ ```python
151
+ import torch
152
+ from transformers import BertModel, BertTokenizerFast
153
+
154
+
155
+ tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
156
+ model = BertModel.from_pretrained("setu4993/LaBSE")
157
+ model = model.eval()
158
+
159
+ english_sentences = [
160
+ "dog",
161
+ "Puppies are nice.",
162
+ "I enjoy taking long walks along the beach with my dog.",
163
+ ]
164
+ english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
165
+
166
+ with torch.no_grad():
167
+ english_outputs = model(**english_inputs)
168
+ ```
169
+
170
+ To get the sentence embeddings, use the pooler output:
171
+
172
+ ```python
173
+ english_embeddings = english_outputs.pooler_output
174
+ ```
175
+
176
+ Output for other languages:
177
+
178
+ ```python
179
+ italian_sentences = [
180
+ "cane",
181
+ "I cuccioli sono carini.",
182
+ "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
183
+ ]
184
+ japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
185
+ italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
186
+ japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
187
+
188
+ with torch.no_grad():
189
+ italian_outputs = model(**italian_inputs)
190
+ japanese_outputs = model(**japanese_inputs)
191
+
192
+ italian_embeddings = italian_outputs.pooler_output
193
+ japanese_embeddings = japanese_outputs.pooler_output
194
+ ```
195
+
196
+ For similarity between sentences, an L2-norm is recommended before calculating the similarity:
197
+
198
+ ```python
199
+ import torch.nn.functional as F
200
+
201
+
202
+ def similarity(embeddings_1, embeddings_2):
203
+ normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
204
+ normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
205
+ return torch.matmul(
206
+ normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
207
+ )
208
+
209
+
210
+ print(similarity(english_embeddings, italian_embeddings))
211
+ print(similarity(english_embeddings, japanese_embeddings))
212
+ print(similarity(italian_embeddings, japanese_embeddings))
213
+ ```
214
+
215
 
216
 
217
  ## Evaluation Results
 
234
 
235
  ## Citing & Authors
236
 
237
+ Have a look at [LaBSE](https://tfhub.dev/google/LaBSE/2) for the respective publication that describes LaBSE.