Spaces:
Sleeping
Sleeping
metadata
title: DmxPerplexity
emoji: π
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 4.7.1
app_file: app.py
pinned: false
license: apache-2.0
tags:
- evaluate
- metric
description: >-
Perplexity metric implemented by d-Matrix. Perplexity (PPL) is one of the most
common metrics for evaluating language models. It is defined as the
exponentiated average negative log-likelihood of a sequence, calculated with
exponent base `e`. Note that this metric is intended for Causual Language
Models, the perplexity calculation is only correct if model uses Cross Entropy
Loss. For more information, see
https://huggingface.co/docs/transformers/perplexity
Metric Card for Perplexity
Metric Description
Perplexity metric implemented by d-Matrix.
Perplexity (PPL) is one of the most common metrics for evaluating language models.
It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e
.
Note that this metric is intended for Causual Language Models, the perplexity calculation is only correct if model uses Cross Entropy Loss.
For more information, see https://huggingface.co/docs/transformers/perplexity
How to Use
At minimum, this metric requires the model and references as inputs.
>>> import evaluate
>>> perplexity = evaluate.load("d-matrix/dmx_perplexity", module_type="metric")
>>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
>>> results = perplexity.compute(model='distilgpt2',references=input_texts)
>>> print(results)
{'loss': 4.993086338043213, 'perplexity': 147.390625}
Inputs
- model (
Union
[str
,AutoModelForCausalLM
]): model used for calculating Perplexity - references (
list
ofstr
): input text, each separate text snippet is one list entry. - device (
str
): device to run on, defaults to 'cuda' when available. - max_length (
int
): maximum sequence length, defaults to 2048.
Output Values
- loss (
float
): the loss of the model predictions compared to the reference - perplexity(
float
): measures the uncertainty of a model predicting text. Model performance is better when perplexity is lower.
Output Example(s):
{'loss': 4.993086338043213, 'perplexity': 147.390625}
This metric outputs a dictionary, containing the loss and perplexity score.
Examples
>>> import evaluate
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> perplexity = evaluate.load("d-matrix/dmx_perplexity", module_type="metric")
>>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10]
>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
>>> results = perplexity.compute(model=model,references=input_texts)
>>> print(list(results.keys()))
['loss', 'perplexity']
>>> print(results['loss'])
3.9706921577453613
>>> print(results['perplexity'])
53.021217346191406