bleurt / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: 828c6327)
981697b
|
raw
history blame
4.35 kB
metadata
title: BLEURT
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric

Metric Card for BLEURT

Metric Description

BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model Devlin et al. 2018, employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations.

It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). See the project's README for more information.

Intended Uses

BLEURT is intended to be used for evaluating text produced by language models.

How to Use

This metric takes as input lists of predicted sentences and reference sentences:

>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", type="metric")
>>> results = bleurt.compute(predictions=predictions, references=references)

Inputs

  • predictions (list of strs): List of generated sentences to score.
  • references (list of strs): List of references to compare to.
  • checkpoint (str): BLEURT checkpoint. Will default to BLEURT-tiny if not specified. Other models that can be chosen are: "bleurt-tiny-128", "bleurt-tiny-512", "bleurt-base-128", "bleurt-base-512", "bleurt-large-128", "bleurt-large-512", "BLEURT-20-D3", "BLEURT-20-D6", "BLEURT-20-D12" and "BLEURT-20".

Output Values

  • scores : a list of scores, one per prediction.

Output Example:

{'scores': [1.0295498371124268, 1.0445425510406494]}

BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts.

Values from Popular Papers

The original BLEURT paper reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore.

BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]).

Examples

Example with the default model:

>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", type="metric")
>>> results = bleurt.compute(predictions=predictions, references=references)
>>> print(results)
{'scores': [1.0295498371124268, 1.0445425510406494]}

Example with the "bleurt-base-128" model checkpoint:

>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", type="metric", checkpoint="bleurt-base-128")
>>> results = bleurt.compute(predictions=predictions, references=references)
>>> print(results)
{'scores': [1.0295498371124268, 1.0445425510406494]}

Limitations and Bias

The original BLEURT paper showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected.

Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data.

Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, bleurt-tiny, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue.

Citation

@inproceedings{bleurt,
  title={BLEURT: Learning Robust Metrics for Text Generation},
  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
  booktitle={ACL},
  year={2020},
  url={https://arxiv.org/abs/2004.04696}
}

Further References