Fill-Mask
Transformers
PyTorch
Japanese
bert
Inference Endpoints
File size: 3,451 Bytes
f1a787a
 
 
 
 
 
 
 
 
e58d212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1a787a
 
 
 
 
 
 
 
 
 
b3ec1e2
f1a787a
 
 
 
e58d212
 
f1a787a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3ec1e2
 
f1a787a
 
e58d212
f1a787a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: cc-by-sa-4.0
datasets:
- unicamp-dl/mmarco
- bclavie/mmarco-japanese-hard-negatives
language:
- ja
---



|                     |     |           | JQaRa     |           |           |
| ------------------- | --- | --------- | --------- | --------- | --------- |
|                     |     | NDCG@10   | MRR@10    | NDCG@100  | MRR@100   |
| splade-japanese-v3  |     |  0.505    | 0.772     |    0.7    |  0.775    |
| JaColBERTv2         |     |  0.585    | 0.836     | 0.753     | 0.838     |
| JaColBERT           |     | 0.549     | 0.811     | 0.730     | 0.814     |
| bge-m3+all          |     | 0.576     | 0.818     | 0.745     | 0.820     |
| bg3-m3+dense        |     | 0.539     | 0.785     | 0.721     | 0.788     |
| m-e5-large          |     | 0.554     | 0.799     | 0.731     | 0.801     |
| m-e5-base           |     | 0.471     | 0.727     | 0.673     | 0.731     |
| m-e5-small          |     | 0.492     | 0.729     | 0.689     | 0.733     |
| GLuCoSE             |     | 0.308     | 0.518     | 0.564     | 0.527     |
| sup-simcse-ja-base  |     | 0.324     | 0.541     | 0.572     | 0.550     |
| sup-simcse-ja-large |     | 0.356     | 0.575     | 0.596     | 0.583     |
| fio-base-v0.1       |     | 0.372     | 0.616     | 0.608     | 0.622     |

## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
These models don't train on the MIRACL training data.

| Model            | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
|------------------|---------|-------------|----------|-----------|
| BM25             | 0.369   | 0.931       | -        | -         |
| splade-japanese  | 0.405   | 0.931       | 0.406    | 0.663     |
| splade-japanese-efficient| 0.408  | 0.954      | 0.419   | 0.718    |
| splade-japanese-v2 | 0.580   | 0.967       | 0.629    | 0.844     |
| splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
| splade-japanese-v3 | **0.604** | **0.979** | **0.647** | **0.877** |


*'splade-japanese-v2-doc' model does not require query encoder during inference.



下のコードを実行すれば,単語拡張や重み付けの確認ができます.

If you'd like to try it out, you can see the expansion of queries or documents by running the code below.

you need to install 

```
!pip install fugashi ipadic unidic-lite
```

```python
from transformers import AutoModelForMaskedLM,AutoTokenizer
import torch
import numpy as np

model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3") 
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}

def encode_query(query): ##query passsage maxlen: 32,180
    query = tokenizer(query, return_tensors="pt")
    output = model(**query, return_dict=True).logits
    output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
    return output

with torch.no_grad():
    model_output = encode_query(query="筑波大学では何の研究が行われているか?")

reps = model_output
idx = torch.nonzero(reps[0], as_tuple=False)

dict_splade = {}
for i in idx:
    token_value = reps[0][i[0]].item()
    if token_value > 0:
        token = vocab_dict[int(i[0])]
        dict_splade[token] = float(token_value)

sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
for token, value in sorted_dict_splade:
    print(token, value)
```