Update README.md
Browse files
README.md
CHANGED
@@ -100,11 +100,61 @@ base_model:
|
|
100 |
- FacebookAI/xlm-roberta-large
|
101 |
---
|
102 |
|
103 |
-
# PreCOMET-var
|
104 |
|
105 |
This is a source-only COMET model used for efficient evaluation subset selection.
|
106 |
-
|
|
|
|
|
|
|
|
|
|
|
107 |
|
108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
- FacebookAI/xlm-roberta-large
|
101 |
---
|
102 |
|
103 |
+
# PreCOMET-var [](https://arxiv.org/abs/2501.18251)
|
104 |
|
105 |
This is a source-only COMET model used for efficient evaluation subset selection.
|
106 |
+
Specifically this model predicts expected variance in human scores in translations. Trained on direct assessment scores from up to WMT2022.
|
107 |
+
The higher the scores, the better it is for evaluation because it will likely distinguish between systems.
|
108 |
+
It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
|
109 |
+
```bash
|
110 |
+
pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
|
111 |
+
```
|
112 |
|
113 |
+
You can then use it in Python:
|
114 |
+
```python
|
115 |
+
import precomet
|
116 |
+
model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-var"))
|
117 |
+
model.predict([
|
118 |
+
{"src": "This is an easy source sentence."},
|
119 |
+
{"src": "this is a much more complicated source sen-tence that will pro路bably lead to loww scores 馃お"}
|
120 |
+
])["scores"]
|
121 |
+
> [70.99381256103516, 70.99385833740234]
|
122 |
+
```
|
123 |
|
124 |
+
The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:
|
125 |
+
|
126 |
+
```python
|
127 |
+
import subset2evaluate
|
128 |
+
|
129 |
+
data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
|
130 |
+
data_random = subset2evaluate.select_subset.basic(data_full, method="random")
|
131 |
+
subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
|
132 |
+
> 1
|
133 |
+
subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
|
134 |
+
> 0.71
|
135 |
+
```
|
136 |
+
Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
|
137 |
+
```python
|
138 |
+
data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_var")
|
139 |
+
subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
|
140 |
+
> 2
|
141 |
+
subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
|
142 |
+
> 0.92
|
143 |
+
```
|
144 |
+
we get higher correlation and number of clusters.
|
145 |
+
However, you can expect a bigger effect on a larger scale, as described in the paper.
|
146 |
+
|
147 |
+
|
148 |
+
This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
|
149 |
+
Cite as:
|
150 |
+
```
|
151 |
+
@misc{zouhar2025selectdatapointsefficienthuman,
|
152 |
+
title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
|
153 |
+
author={Vil茅m Zouhar and Peng Cui and Mrinmaya Sachan},
|
154 |
+
year={2025},
|
155 |
+
eprint={2501.18251},
|
156 |
+
archivePrefix={arXiv},
|
157 |
+
primaryClass={cs.CL},
|
158 |
+
url={https://arxiv.org/abs/2501.18251},
|
159 |
+
}
|
160 |
+
```
|