Fill-Mask
Transformers
PyTorch
Japanese
bert
Inference Endpoints
splade-japanese-v3 / README.md
aken12's picture
Update README.md
e3af493 verified
---
license: cc-by-sa-4.0
datasets:
- unicamp-dl/mmarco
- bclavie/mmarco-japanese-hard-negatives
language:
- ja
---
## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
These models don't train on the MIRACL training data.
| Model | nDCG@10 | Recall@1000 | Recall@5 | Recall@30 |
|------------------|---------|-------------|----------|-----------|
| BM25 | 0.369 | 0.931 | - | - |
| splade-japanese | 0.405 | 0.931 | 0.406 | 0.663 |
| splade-japanese-efficient| 0.408 | 0.954 | 0.419 | 0.718 |
| splade-japanese-v2 | 0.580 | 0.967 | 0.629 | 0.844 |
| splade-japanese-v2-doc | 0.478 | 0.930 | 0.514 | 0.759 |
| splade-japanese-v3 | **0.604** | **0.979** | **0.647** | **0.877** |
*'splade-japanese-v2-doc' model does not require query encoder during inference.
## Evaluation on [hotchpotch/JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)
| | | | JQaRa | | |
| ------------------- | --- | --------- | --------- | --------- | --------- |
| | | NDCG@10 | MRR@10 | NDCG@100 | MRR@100 |
| splade-japanese-v3 | | 0.505 | 0.772 | 0.7 | 0.775 |
| JaColBERTv2 | | 0.585 | 0.836 | 0.753 | 0.838 |
| JaColBERT | | 0.549 | 0.811 | 0.730 | 0.814 |
| bge-m3+all | | 0.576 | 0.818 | 0.745 | 0.820 |
| bg3-m3+dense | | 0.539 | 0.785 | 0.721 | 0.788 |
| m-e5-large | | 0.554 | 0.799 | 0.731 | 0.801 |
| m-e5-base | | 0.471 | 0.727 | 0.673 | 0.731 |
| m-e5-small | | 0.492 | 0.729 | 0.689 | 0.733 |
| GLuCoSE | | 0.308 | 0.518 | 0.564 | 0.527 |
| sup-simcse-ja-base | | 0.324 | 0.541 | 0.572 | 0.550 |
| sup-simcse-ja-large | | 0.356 | 0.575 | 0.596 | 0.583 |
| fio-base-v0.1 | | 0.372 | 0.616 | 0.608 | 0.622 |
下のコードを実行すれば,単語拡張や重み付けの確認ができます.
If you'd like to try it out, you can see the expansion of queries or documents by running the code below.
you need to install
```
!pip install fugashi ipadic unidic-lite
```
```python
from transformers import AutoModelForMaskedLM,AutoTokenizer
import torch
import numpy as np
model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}
def encode_query(query): ##query passsage maxlen: 32,180
query = tokenizer(query, return_tensors="pt")
output = model(**query, return_dict=True).logits
output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
return output
with torch.no_grad():
model_output = encode_query(query="筑波大学では何の研究が行われているか?")
reps = model_output
idx = torch.nonzero(reps[0], as_tuple=False)
dict_splade = {}
for i in idx:
token_value = reps[0][i[0]].item()
if token_value > 0:
token = vocab_dict[int(i[0])]
dict_splade[token] = float(token_value)
sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
for token, value in sorted_dict_splade:
print(token, value)
```