Lighteval documentation

Adding a New Metric

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Adding a New Metric

First, check if you can use one of the parametrized functions in Corpus Metrics or Sample Metrics.

If not, you can use the custom_task system to register your new metric:

To see an example of a custom metric added along with a custom task, look at the IFEval custom task.

To contribute your custom metric to the lighteval repo, you would first need to install the required dev dependencies by running pip install -e .[dev] and then run pre-commit install to install the pre-commit hooks.

  • Create a new Python file which should contain the full logic of your metric.
  • The file also needs to start with these imports
from aenum import extend_enum
from lighteval.metrics import Metrics

You need to define a sample level metric:

def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
    response = predictions[0]
    return response == formatted_doc.choices[formatted_doc.gold_index]

Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.

def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
    response = predictions[0]
    return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}

Then, you can define an aggregation function if needed, a common aggregation function is np.mean.

def agg_function(items):
    flat_items = [item for sublist in items for item in sublist]
    score = sum(flat_items) / len(flat_items)
    return score

Finally, you can define your metric. If it’s a sample level metric, you can use the following code with SampleLevelMetric:

my_custom_metric = SampleLevelMetric(
    metric_name={custom_metric_name},
    higher_is_better={either True or False},
    category={MetricCategory},
    use_case={MetricUseCase},
    sample_level_fn=custom_metric,
    corpus_level_fn=agg_function,
)

If your metric defines multiple metrics per sample, you can use the following code with SampleLevelMetricGrouping:

custom_metric = SampleLevelMetricGrouping(
    metric_name={submetric_names},
    higher_is_better={n: {True or False} for n in submetric_names},
    category={MetricCategory},
    use_case={MetricUseCase},
    sample_level_fn=custom_metric,
    corpus_level_fn={
        "accuracy": np.mean,
        "other_metric": agg_function,
    },
)

To finish, add the following, so that it adds your metric to our metrics list when loaded as a module.

# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
    print("Imported metric")

You can then give your custom metric to lighteval by using --custom-tasks path_to_your_file when launching it.

< > Update on GitHub