aken12
/

splade-japanese-v3

Inference Endpoints

Model card Files Files and versions Community

aken12 commited on Mar 29

Commit

f1a787a

•

1 Parent(s): a98ce1e

Create README.md

Files changed (1) hide show

README.md +66 -0

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+license: cc-by-sa-4.0
+datasets:
+- unicamp-dl/mmarco
+- bclavie/mmarco-japanese-hard-negatives
+language:
+- ja
+---
+## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
+These models don't train on the MIRACL training data.
+| Model            | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
+|------------------|---------|-------------|----------|-----------|
+| BM25             | 0.369   | 0.931       | -        | -         |
+| splade-japanese  | 0.405   | 0.931       | 0.406    | 0.663     |
+| splade-japanese-efficient| 0.408  | 0.954      | 0.419   | 0.718    |
+| splade-japanese-v2 | 0.580   | 0.967       | 0.629    | 0.844     |
+| splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
+| splade-japanese-v3 | 0.604 | 0.979 | 0.647 | 0.877 |
+*'splade-japanese-v2-doc' model does not require query encoder during inference.
+下のコードを実行すれば，単語拡張や重み付けの確認ができます．
+If you'd like to try it out, you can see the expansion of queries or documents by running the code below.
+you need to install
+```
+!pip install fugashi ipadic unidic-lite
+```
+```python
+from transformers import AutoModelForMaskedLM,AutoTokenizer
+import torch
+import numpy as np
+model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v2")
+tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v2")
+vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}
+def encode_query(query):
+    query = tokenizer(query, return_tensors="pt")
+    output = model(**query, return_dict=True).logits
+    output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
+    return output
+with torch.no_grad():
+    model_output = encode_query(query="筑波大学では何の研究が行われているか？")
+reps = model_output
+idx = torch.nonzero(reps[0], as_tuple=False)
+dict_splade = {}
+for i in idx:
+    token_value = reps[0][i[0]].item()
+    if token_value > 0:
+        token = vocab_dict[int(i[0])]
+        dict_splade[token] = float(token_value)
+sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
+for token, value in sorted_dict_splade:
+    print(token, value)
+```