metadata

title: DmxPerplexity
emoji: 🌖
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 4.7.1
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - evaluate
  - metric
description: >-
  Perplexity metric implemented by d-Matrix. Perplexity (PPL) is one of the most
  common metrics for evaluating language models. It is defined as the
  exponentiated average negative log-likelihood of a sequence, calculated with
  exponent base `e`. Note that this metric is intended for Causual Language
  Models, the perplexity calculation is only correct if model uses Cross Entropy
  Loss. For more information, see
  https://huggingface.co/docs/transformers/perplexity

Metric Card for Perplexity

Metric Description

Perplexity metric implemented by d-Matrix. Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e. Note that this metric is intended for Causual Language Models, the perplexity calculation is only correct if model uses Cross Entropy Loss. For more information, see https://huggingface.co/docs/transformers/perplexity

How to Use

At minimum, this metric requires the model and references as inputs.

>>> import evaluate
>>> perplexity = evaluate.load("d-matrix/dmx_perplexity", module_type="metric")
>>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
>>> results = perplexity.compute(model='distilgpt2',references=input_texts)
>>> print(results)
{'loss': 4.993086338043213, 'perplexity': 147.390625}

Inputs

model (Union[str,AutoModelForCausalLM]): model used for calculating Perplexity
references (list of str): input text, each separate text snippet is one list entry.
device (str): device to run on, defaults to 'cuda' when available.
max_length (int): maximum sequence length, defaults to 2048.

Output Values

loss (float): the loss of the model predictions compared to the reference
perplexity(float): measures the uncertainty of a model predicting text. Model performance is better when perplexity is lower.

Output Example(s):

{'loss': 4.993086338043213, 'perplexity': 147.390625}

This metric outputs a dictionary, containing the loss and perplexity score.

Examples

>>> import evaluate
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> perplexity = evaluate.load("d-matrix/dmx_perplexity", module_type="metric")
>>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10]
>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
>>> results = perplexity.compute(model=model,references=input_texts)
>>> print(list(results.keys()))
['loss', 'perplexity']
>>> print(results['loss']) 
3.9706921577453613
>>> print(results['perplexity']) 
53.021217346191406

Citation(s)

https://huggingface.co/docs/transformers/perplexity