metadata

license: apache-2.0

MetricX-24 (XL, bfloat16)

This is not an officially supported Google product.

ℹ️ For the full-precision (float32) variant of this model, see MetricX-24 (XL).

GitHub repository: https://github.com/google-research/metricx

The repository contains the code for running inference on MetricX-24 models, a family of models for automatic evaluation of translations that were proposed in the WMT'24 Metrics Shared Task submission MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task. The models were trained in T5X and then converted for use in PyTorch.

Available Models

There are 3 MetricX-24 models available on Hugging Face that vary in the number of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all hybrid models that can do both reference-based and reference-free (also known as quality estimation, or QE) inference:

We recommend using the XXL model versions for the best agreement with human judgments of translation quality, the Large versions for best speed, and the XL for an intermediate use case.

Changes to the WMT'24 Submission

The MetricX-24 models available here are most similar to the primary submission to the WMT'24 Metrics Shared Task. They are initialized with mT5, then fine-tuned on a combination of direct assessment and MQM data from WMT'15-'22. However, we made a couple of small changes that make these models different from the WMT'24 submissions.

First, the metric scores get automatically clipped at 0 and 25, to ensure they are strictly in the [0, 25] range, as due to the nature of regression models, the scores could otherwise sometimes fall outside the range.

Second, we included one additional type of synthetic training examples that weren't ready in time for the official submission. These are examples of perfect translations of multi-sentence segments, generated from the MQM data from WMT'20-'22. The purpose of this category of synthetic data is to reduce the model's bias against longer translations when the source segment and/or reference are also long.

Model Performance

For comparison with the submissions to WMT'24 Metrics Shared Task, we provide an overview of the system- and segment-level correlation scores between the MetricX-24 scores and MQM ratings of translation quality, as calculated on the shared task's test sets:

Model	Sys-Level SPA (en-de)	Seg-Level Acc (en-de)	Sys-Level SPA (en-es)	Seg-Level Acc (en-es)	Sys-Level SPA (ja-zh)	Seg-Level Acc (ja-zh)
MetricX-24-Hybrid-XXL	0.865	0.543	0.785	0.685	0.878	0.541
MetricX-24-Hybrid-XL	0.884	0.522	0.806	0.683	0.859	0.528
MetricX-24-Hybrid-Large	0.879	0.511	0.795	0.686	0.845	0.514
MetricX-24-Hybrid-QE-XXL	0.884	0.525	0.789	0.685	0.863	0.527
MetricX-24-Hybrid-QE-XL	0.879	0.502	0.774	0.683	0.849	0.509
MetricX-24-Hybrid-QE-Large	0.809	0.490	0.762	0.684	0.847	0.508

Below are the above correlation scores averaged, as used in the shared task to determine the final ranking of the submissions:

Model	Average Correlation
MetricX-24-Hybrid-XXL	0.716
MetricX-24-Hybrid-XL	0.714
MetricX-24-Hybrid-Large	0.705
MetricX-24-Hybrid-QE-XXL	0.712
MetricX-24-Hybrid-QE-XL	0.699
MetricX-24-Hybrid-QE-Large	0.683

NOTE: Since MetricX-24 models are hybrid models, MetricX-24-<size> and MetricX-24-QE-<size> correspond to the same model, evaluated with and without the references, respectively.

Citation

If you use MetricX-24 in your research, please cite the following publication:

@inproceedings{juraska-etal-2024-metricx,
    title = "{M}etric{X}-24: The {G}oogle Submission to the {WMT} 2024 Metrics Shared Task",
    author = "Juraska, Juraj  and
      Deutsch, Daniel  and
      Finkelstein, Mara  and
      Freitag, Markus",
    editor = "Haddow, Barry  and
      Kocmi, Tom  and
      Koehn, Philipp  and
      Monz, Christof",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.wmt-1.35",
    pages = "492--504",
}