Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # MetricX-24 (XXL, bfloat16)
5
+
6
+ *This is not an officially supported Google product.*
7
+
8
+ > ℹ️ For the full-precision (float32) variant of this model, see [MetricX-24 (XXL)](https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6).
9
+
10
+ **GitHub repository**: https://github.com/google-research/metricx
11
+
12
+ The repository contains the code for running inference on MetricX-24 models,
13
+ a family of models for automatic evaluation of translations that were proposed
14
+ in the WMT'24 Metrics Shared Task submission
15
+ [MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task](https://aclanthology.org/2024.wmt-1.35/).
16
+ The models were trained in [T5X](https://github.com/google-research/t5x) and
17
+ then converted for use in PyTorch.
18
+
19
+
20
+ ## Available Models
21
+
22
+ There are 3 MetricX-24 models available on Hugging Face that vary in the number
23
+ of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all
24
+ hybrid models that can do both reference-based and reference-free (also known as
25
+ quality estimation, or QE) inference:
26
+
27
+ * [MetricX-24-Hybrid-XXL](https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6)
28
+ * [MetricX-24-Hybrid-XL](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6)
29
+ * [MetricX-24-Hybrid-Large](https://huggingface.co/google/metricx-24-hybrid-large-v2p6)
30
+
31
+ We recommend using the XXL model versions for the best agreement with human
32
+ judgments of translation quality, the Large versions for best speed, and the
33
+ XL for an intermediate use case.
34
+
35
+
36
+ ## Changes to the WMT'24 Submission
37
+
38
+ The MetricX-24 models available here are most similar to the primary submission
39
+ to the WMT'24 Metrics Shared Task. They are initialized with
40
+ [mT5](https://aclanthology.org/2021.naacl-main.41/),
41
+ then fine-tuned on a combination of direct assessment and MQM data from
42
+ WMT'15-'22. However, we made a couple of small changes that make these models
43
+ different from the WMT'24 submissions.
44
+
45
+ First, the metric scores get automatically clipped at 0 and 25, to ensure they
46
+ are strictly in the [0, 25] range, as due to the nature of regression models,
47
+ the scores could otherwise sometimes fall outside the range.
48
+
49
+ Second, we included one additional type of synthetic training examples that
50
+ weren't ready in time for the official submission. These are examples of perfect
51
+ translations of multi-sentence segments, generated from the MQM data from
52
+ WMT'20-'22. The purpose of this category of synthetic data is to reduce the
53
+ model's bias against longer translations when the source segment and/or
54
+ reference are also long.
55
+
56
+
57
+ ## Model Performance
58
+
59
+ For comparison with the submissions to
60
+ [WMT'24 Metrics Shared Task](https://www2.statmt.org/wmt24/pdf/2024.wmt-1.2.pdf),
61
+ we provide an overview of the system- and segment-level correlation scores
62
+ between the MetricX-24 scores and MQM ratings of translation quality, as
63
+ calculated on the shared task's test sets:
64
+
65
+ | Model | Sys-Level SPA (en-de) | Seg-Level Acc (en-de) | Sys-Level SPA (en-es) | Seg-Level Acc (en-es) | Sys-Level SPA (ja-zh) | Seg-Level Acc (ja-zh) |
66
+ | -------------------------- | ----- | ----- | ----- | ----- | ----- | ----- |
67
+ | MetricX-24-Hybrid-XXL | 0.865 | 0.543 | 0.785 | 0.685 | 0.878 | 0.541 |
68
+ | MetricX-24-Hybrid-XL | 0.884 | 0.522 | 0.806 | 0.683 | 0.859 | 0.528 |
69
+ | MetricX-24-Hybrid-Large | 0.879 | 0.511 | 0.795 | 0.686 | 0.845 | 0.514 |
70
+ | MetricX-24-Hybrid-QE-XXL | 0.884 | 0.525 | 0.789 | 0.685 | 0.863 | 0.527 |
71
+ | MetricX-24-Hybrid-QE-XL | 0.879 | 0.502 | 0.774 | 0.683 | 0.849 | 0.509 |
72
+ | MetricX-24-Hybrid-QE-Large | 0.809 | 0.490 | 0.762 | 0.684 | 0.847 | 0.508 |
73
+
74
+ Below are the above correlation scores averaged, as used in the shared task to
75
+ determine the final ranking of the submissions:
76
+
77
+ | Model | Average Correlation |
78
+ | -------------------------- | ----- |
79
+ | MetricX-24-Hybrid-XXL | 0.716 |
80
+ | MetricX-24-Hybrid-XL | 0.714 |
81
+ | MetricX-24-Hybrid-Large | 0.705 |
82
+ | MetricX-24-Hybrid-QE-XXL | 0.712 |
83
+ | MetricX-24-Hybrid-QE-XL | 0.699 |
84
+ | MetricX-24-Hybrid-QE-Large | 0.683 |
85
+
86
+ NOTE: Since MetricX-24 models are hybrid models, MetricX-24-\<size\> and
87
+ MetricX-24-QE-\<size\> correspond to the same model, evaluated *with* and
88
+ *without* the references, respectively.
89
+
90
+
91
+ ## Citation
92
+
93
+ If you use MetricX-24 in your research, please cite the following publication:
94
+
95
+ ```bibtex
96
+ @inproceedings{juraska-etal-2024-metricx,
97
+ title = "{M}etric{X}-24: The {G}oogle Submission to the {WMT} 2024 Metrics Shared Task",
98
+ author = "Juraska, Juraj and
99
+ Deutsch, Daniel and
100
+ Finkelstein, Mara and
101
+ Freitag, Markus",
102
+ editor = "Haddow, Barry and
103
+ Kocmi, Tom and
104
+ Koehn, Philipp and
105
+ Monz, Christof",
106
+ booktitle = "Proceedings of the Ninth Conference on Machine Translation",
107
+ month = nov,
108
+ year = "2024",
109
+ address = "Miami, Florida, USA",
110
+ publisher = "Association for Computational Linguistics",
111
+ url = "https://aclanthology.org/2024.wmt-1.35",
112
+ pages = "492--504",
113
+ }
114
+ ```