sdadas commited on
Commit
1fa7292
1 Parent(s): 3fa159f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md CHANGED
@@ -1,3 +1,75 @@
1
  ---
 
 
 
 
 
2
  license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-classification
3
+ tags:
4
+ - transformers
5
+ - information-retrieval
6
+ language: pl
7
  license: apache-2.0
8
+
9
  ---
10
+
11
+ <h1 align="center">polish-reranker-large-ranknet</h1>
12
+
13
+ This is a Polish text ranking model trained with [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents.
14
+ The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries).
15
+ As a teacher model, we employed [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k), a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose [Polish RoBERTa](https://huggingface.co/sdadas/polish-roberta-large-v2).
16
+ Unlike more commonly used pointwise losses, which regard each query-document pair independently, the RankNet method computes loss based on queries and pairs of documents. More specifically, the loss is computed based on the relative order of documents sorted by their relevance to the query.
17
+ To train the reranker, we used the teacher model to assess the relevance of the documents extracted in the retrieval stage for each query. We then sorted these documents by the relevance score, obtaining a dataset consisting of queries and ordered lists of 20 documents per query.
18
+
19
+ 💡 The method has proven to be highly effective. The provided model outperforms the teacher model on the Polish Information Retrieval Benchmark, despite having 30 times fewer parameters and being 33 times faster than the teacher! 💡
20
+
21
+ ## Usage (Sentence-Transformers)
22
+
23
+ You can use the model like this with [sentence-transformers](https://www.SBERT.net):
24
+
25
+ ```python
26
+ from sentence_transformers import CrossEncoder
27
+ import torch.nn
28
+
29
+ query = "Jak dożyć 100 lat?"
30
+ answers = [
31
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
32
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
33
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
34
+ ]
35
+
36
+ model = CrossEncoder(
37
+ "sdadas/polish-reranker-large-ranknet",
38
+ default_activation_function=torch.nn.Identity(),
39
+ max_length=512,
40
+ device="cuda" if torch.cuda.is_available() else "cpu"
41
+ )
42
+ pairs = [[query, answer] for answer in answers]
43
+ results = model.predict(pairs)
44
+ print(results.tolist())
45
+ ```
46
+
47
+ ## Usage (Huggingface Transformers)
48
+
49
+ The model can also be used with Huggingface Transformers in the following way:
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
53
+ import numpy as np
54
+
55
+ query = "Jak dożyć 100 lat?"
56
+ answers = [
57
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
58
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
59
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
60
+ ]
61
+
62
+ model_name = "sdadas/polish-reranker-large-ranknet"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
64
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
65
+ texts = [f"{query}</s></s>{answer}" for answer in answers]
66
+ tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
67
+ output = model(**tokens)
68
+ results = output.logits.detach().numpy()
69
+ results = np.squeeze(results)
70
+ print(results.tolist())
71
+ ```
72
+
73
+ ## Evaluation Results
74
+
75
+ The model achieves **NDCG@10** of **62.65** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.