aken12
/

splade-japanese-v3

Inference Endpoints

Model card Files Files and versions Community

splade-japanese-v3 / README.md

aken12's picture

Update README.md

e3af493 verified 3 months ago

|

raw history blame contribute delete

No virus

3.54 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- unicamp-dl/mmarco
	- bclavie/mmarco-japanese-hard-negatives
	language:
	- ja
	---



	## Evaluation on [MIRACL japanese](https://huggingface.co/datasets/miracl/miracl)
	These models don't train on the MIRACL training data.

	\| Model \| nDCG@10 \| Recall@1000 \| Recall@5 \| Recall@30 \|
	\|------------------\|---------\|-------------\|----------\|-----------\|
	\| BM25 \| 0.369 \| 0.931 \| - \| - \|
	\| splade-japanese \| 0.405 \| 0.931 \| 0.406 \| 0.663 \|
	\| splade-japanese-efficient\| 0.408 \| 0.954 \| 0.419 \| 0.718 \|
	\| splade-japanese-v2 \| 0.580 \| 0.967 \| 0.629 \| 0.844 \|
	\| splade-japanese-v2-doc \| 0.478 \| 0.930 \| 0.514 \| 0.759 \|
	\| splade-japanese-v3 \| 0.604 \| 0.979 \| 0.647 \| 0.877 \|


	*'splade-japanese-v2-doc' model does not require query encoder during inference.


	## Evaluation on [hotchpotch/JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)

	\| \| \| \| JQaRa \| \| \|
	\| ------------------- \| --- \| --------- \| --------- \| --------- \| --------- \|
	\| \| \| NDCG@10 \| MRR@10 \| NDCG@100 \| MRR@100 \|
	\| splade-japanese-v3 \| \| 0.505 \| 0.772 \| 0.7 \| 0.775 \|
	\| JaColBERTv2 \| \| 0.585 \| 0.836 \| 0.753 \| 0.838 \|
	\| JaColBERT \| \| 0.549 \| 0.811 \| 0.730 \| 0.814 \|
	\| bge-m3+all \| \| 0.576 \| 0.818 \| 0.745 \| 0.820 \|
	\| bg3-m3+dense \| \| 0.539 \| 0.785 \| 0.721 \| 0.788 \|
	\| m-e5-large \| \| 0.554 \| 0.799 \| 0.731 \| 0.801 \|
	\| m-e5-base \| \| 0.471 \| 0.727 \| 0.673 \| 0.731 \|
	\| m-e5-small \| \| 0.492 \| 0.729 \| 0.689 \| 0.733 \|
	\| GLuCoSE \| \| 0.308 \| 0.518 \| 0.564 \| 0.527 \|
	\| sup-simcse-ja-base \| \| 0.324 \| 0.541 \| 0.572 \| 0.550 \|
	\| sup-simcse-ja-large \| \| 0.356 \| 0.575 \| 0.596 \| 0.583 \|
	\| fio-base-v0.1 \| \| 0.372 \| 0.616 \| 0.608 \| 0.622 \|


	下のコードを実行すれば，単語拡張や重み付けの確認ができます．

	If you'd like to try it out, you can see the expansion of queries or documents by running the code below.

	you need to install

	```
	!pip install fugashi ipadic unidic-lite
	```

	```python
	from transformers import AutoModelForMaskedLM,AutoTokenizer
	import torch
	import numpy as np

	model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
	tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
	vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}

	def encode_query(query): ##query passsage maxlen: 32,180
	query = tokenizer(query, return_tensors="pt")
	output = model(**query, return_dict=True).logits
	output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
	return output

	with torch.no_grad():
	model_output = encode_query(query="筑波大学では何の研究が行われているか？")

	reps = model_output
	idx = torch.nonzero(reps[0], as_tuple=False)

	dict_splade = {}
	for i in idx:
	token_value = reps[0][i[0]].item()
	if token_value > 0:
	token = vocab_dict[int(i[0])]
	dict_splade[token] = float(token_value)

	sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
	for token, value in sorted_dict_splade:
	print(token, value)
	```