polish-reranker-base-ranknet
This is a Polish text ranking model trained with RankNet loss on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents. The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries). As a teacher model, we employed unicamp-dl/mt5-13b-mmarco-100k, a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose Polish RoBERTa. Unlike more commonly used pointwise losses, which regard each query-document pair independently, the RankNet method computes loss based on queries and pairs of documents. More specifically, the loss is computed based on the relative order of documents sorted by their relevance to the query. To train the reranker, we used the teacher model to assess the relevance of the documents extracted in the retrieval stage for each query. We then sorted these documents by the relevance score, obtaining a dataset consisting of queries and ordered lists of 20 documents per query.
Usage (Sentence-Transformers)
You can use the model like this with sentence-transformers:
from sentence_transformers import CrossEncoder
import torch.nn
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = CrossEncoder(
"sdadas/polish-reranker-base-ranknet",
default_activation_function=torch.nn.Identity(),
max_length=512,
device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())
Usage (Huggingface Transformers)
The model can also be used with Huggingface Transformers in the following way:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model_name = "sdadas/polish-reranker-base-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())
Evaluation Results
The model achieves NDCG@10 of 60.32 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
Citation
@article{dadas2024assessing,
title={Assessing generalization capability of text ranking models in Polish},
author={Sławomir Dadas and Małgorzata Grębowiec},
year={2024},
eprint={2402.14318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 1,560