Fill-Mask
Transformers
PyTorch
Japanese
bert
Inference Endpoints
File size: 3,538 Bytes
f1a787a
 
 
 
 
 
 
 
 
e58d212
 
f1a787a
 
 
 
 
 
 
 
 
 
b3ec1e2
f1a787a
 
 
 
e58d212
e3af493
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e58d212
f1a787a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3ec1e2
 
f1a787a
 
e58d212
f1a787a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: cc-by-sa-4.0
datasets:
- unicamp-dl/mmarco
- bclavie/mmarco-japanese-hard-negatives
language:
- ja
---



## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
These models don't train on the MIRACL training data.

| Model            | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
|------------------|---------|-------------|----------|-----------|
| BM25             | 0.369   | 0.931       | -        | -         |
| splade-japanese  | 0.405   | 0.931       | 0.406    | 0.663     |
| splade-japanese-efficient| 0.408  | 0.954      | 0.419   | 0.718    |
| splade-japanese-v2 | 0.580   | 0.967       | 0.629    | 0.844     |
| splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
| splade-japanese-v3 | **0.604** | **0.979** | **0.647** | **0.877** |


*'splade-japanese-v2-doc' model does not require query encoder during inference.


## Evaluation on [hotchpotch/JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)

|                     |     |           | JQaRa     |           |           |
| ------------------- | --- | --------- | --------- | --------- | --------- |
|                     |     | NDCG@10   | MRR@10    | NDCG@100  | MRR@100   |
| splade-japanese-v3  |     |  0.505    | 0.772     |    0.7    |  0.775    |
| JaColBERTv2         |     |  0.585    | 0.836     | 0.753     | 0.838     |
| JaColBERT           |     | 0.549     | 0.811     | 0.730     | 0.814     |
| bge-m3+all          |     | 0.576     | 0.818     | 0.745     | 0.820     |
| bg3-m3+dense        |     | 0.539     | 0.785     | 0.721     | 0.788     |
| m-e5-large          |     | 0.554     | 0.799     | 0.731     | 0.801     |
| m-e5-base           |     | 0.471     | 0.727     | 0.673     | 0.731     |
| m-e5-small          |     | 0.492     | 0.729     | 0.689     | 0.733     |
| GLuCoSE             |     | 0.308     | 0.518     | 0.564     | 0.527     |
| sup-simcse-ja-base  |     | 0.324     | 0.541     | 0.572     | 0.550     |
| sup-simcse-ja-large |     | 0.356     | 0.575     | 0.596     | 0.583     |
| fio-base-v0.1       |     | 0.372     | 0.616     | 0.608     | 0.622     |


下のコードを実行すれば,単語拡張や重み付けの確認ができます.

If you'd like to try it out, you can see the expansion of queries or documents by running the code below.

you need to install 

```
!pip install fugashi ipadic unidic-lite
```

```python
from transformers import AutoModelForMaskedLM,AutoTokenizer
import torch
import numpy as np

model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3") 
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}

def encode_query(query): ##query passsage maxlen: 32,180
    query = tokenizer(query, return_tensors="pt")
    output = model(**query, return_dict=True).logits
    output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
    return output

with torch.no_grad():
    model_output = encode_query(query="筑波大学では何の研究が行われているか?")

reps = model_output
idx = torch.nonzero(reps[0], as_tuple=False)

dict_splade = {}
for i in idx:
    token_value = reps[0][i[0]].item()
    if token_value > 0:
        token = vocab_dict[int(i[0])]
        dict_splade[token] = float(token_value)

sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
for token, value in sorted_dict_splade:
    print(token, value)
```