zouharvi
/

PreCOMET-var

Model card Files Files and versions Community

zouharvi commited on 26 days ago

Commit

5b8c534

·

verified ·

1 Parent(s): 97bbf81

Update README.md

Files changed (1) hide show

README.md +54 -4

README.md CHANGED Viewed

@@ -100,11 +100,61 @@ base_model:
 - FacebookAI/xlm-roberta-large
 ---
-# PreCOMET-var
 This is a source-only COMET model used for efficient evaluation subset selection.
-It is not compatible with the upstream [github.com/Unbabel/COMET/](https://github.com/Unbabel/COMET/) and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET)
-The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package.
-Further description TODO.

 - FacebookAI/xlm-roberta-large
 ---
+# PreCOMET-var [![Paper](https://img.shields.io/badge/📜%20paper-481.svg)](https://arxiv.org/abs/2501.18251)
 This is a source-only COMET model used for efficient evaluation subset selection.
+Specifically this model predicts expected variance in human scores in translations. Trained on direct assessment scores from up to WMT2022.
+The higher the scores, the better it is for evaluation because it will likely distinguish between systems.
+It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
+```bash
+pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
+```
+You can then use it in Python:
+```python
+import precomet
+model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-var"))
+model.predict([
+  {"src": "This is an easy source sentence."},
+  {"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"}
+])["scores"]
+> [70.99381256103516, 70.99385833740234]
+```
+The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:
+```python
+import subset2evaluate
+data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
+data_random = subset2evaluate.select_subset.basic(data_full, method="random")
+subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
+> 1
+subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
+> 0.71
+```
+Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
+```python
+data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_var")
+subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
+> 2
+subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
+> 0.92
+```
+we get higher correlation and number of clusters.
+However, you can expect a bigger effect on a larger scale, as described in the paper.
+This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
+Cite as:
+```
+@misc{zouhar2025selectdatapointsefficienthuman,
+    title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
+    author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan},
+    year={2025},
+    eprint={2501.18251},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2501.18251},
+}
+```