bwang0911 commited on
Commit
2ad5953
1 Parent(s): 182195d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -1083,7 +1083,7 @@ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidire
1083
  We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
1084
  Additionally, we provide the following embedding models:
1085
 
1086
- `jina-embeddings-v2-base-zh` 是支持中英双语的文本向量模型,它支持长达8192字符的文本编码。
1087
  该模型的研发基于BERT架构(JinaBERT),JinaBERT是在BERT架构基础上的改进,首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
1088
  不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。
1089
  除此之外,我们也提供其它向量模型:
@@ -1121,10 +1121,10 @@ def mean_pooling(model_output, attention_mask):
1121
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1122
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1123
 
1124
- sentences = ['How is the weather today?', 'What is the current weather like today?']
1125
 
1126
- tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
1127
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
1128
 
1129
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1130
 
@@ -1145,8 +1145,8 @@ from transformers import AutoModel
1145
  from numpy.linalg import norm
1146
 
1147
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1148
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
1149
- embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
1150
  print(cos_sim(embeddings[0], embeddings[1]))
1151
  ```
1152
 
 
1083
  We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
1084
  Additionally, we provide the following embedding models:
1085
 
1086
+ `jina-embeddings-v2-base-zh` 是支持中英双语的文本**向量**模型,它支持长达**8192字符**的文本编码。
1087
  该模型的研发基于BERT架构(JinaBERT),JinaBERT是在BERT架构基础上的改进,首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
1088
  不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。
1089
  除此之外,我们也提供其它向量模型:
 
1121
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1122
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1123
 
1124
+ sentences = ['How is the weather today?', '今天天气怎么样?']
1125
 
1126
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
1127
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
1128
 
1129
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1130
 
 
1145
  from numpy.linalg import norm
1146
 
1147
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1148
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
1149
+ embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1150
  print(cos_sim(embeddings[0], embeddings[1]))
1151
  ```
1152