Spaces:

yale-nlp
/

InstruSumEval

Runtime error

App Files Files Community

InstruSumEval / src /about.py

henryL7

interface update

9a1d7c7 8 months ago

raw

history blame

3.56 kB

	# ---------------------------------------------------

	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">InstruSumEval Leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	- This leaderboard evaluates the evaluation capabilities of language models on the [salesforce/instrusum](https://huggingface.co/datasets/Salesforce/InstruSum) benchmark from our paper ["Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization"](https://arxiv.org/abs/2311.09184).
	- InstruSum is a benchmark for instruction-controllable summarization, where the goal is to generate summaries that satisfy user-provided instructions.
	- The benchmark contains human evaluations for the generated summaries, on which the models are evaluated as judges for long-context instruction-following.

	### Metrics
	- Accuracy: The percentage of times the model agrees with the human evaluator.
	- Agreement: The Cohen's Kappa score between the model and human evaluator.
	- Self-Accuracy: The percentage of times the model agrees with itself when the inputs are swapped.
	- Self-Agreement: The Cohen's Kappa score between the model and itself when the inputs are swapped.
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	## How it works

	### Task
	The LLMs are evaluated as judges in a pairwise comparison task.
	Each judge is presented with two instruction-controllable summaries and asked to select the better one.
	The model's accuracy and agreement with the human evaluator are then calculated.

	### Dataset
	The human annotations are from the [InstruSum](https://huggingface.co/datasets/Salesforce/InstruSum) dataset.
	Its pairwise annotation [subset](https://huggingface.co/datasets/Salesforce/InstruSum/viewer/human_eval_pairwise) is used for evaluation.

	This subset contains converted pairwise human evaluation results based on the human evaluation results in the [`human_eval`](https://huggingface.co/datasets/Salesforce/InstruSum/viewer/human_eval) subset.

	The conversion process is as follows:
	- The ranking-based human evaluation results are convered into pairwise comparisons for the overall quality aspect.
	- Only comparisons where the annotators reached a consensus are included.
	- Comparisons that resulted in a tie are excluded.

	### Evaluation Details
	- The instruction-controllable summarization is treated as a long-context instruction-following task.
	Therefore, the source article and the instruction is combined to form a single instruction for the model to follow.

	- The LLMs are evaluated on the pairwise comparison task.
	The [prompt](https://github.com/princeton-nlp/LLMBar/blob/main/LLMEvaluator/evaluators/prompts/comparison/Vanilla.txt) from [LLMBar](https://github.com/princeton-nlp/LLMBar) is adopted for the evaluation.

	- The pairwise comparison is conducted bidirectionally. The model's responses are swapped to evaluate the self-agreement.
	"""

	CITATION_BUTTON_LABEL = "Please cite our paper if you use InstruSum in your work."
	CITATION_BUTTON_TEXT = r"""@article{liu2023benchmarking,
	title={Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization},
	author={Liu, Yixin and Fabbri, Alexander R and Chen, Jiawen and Zhao, Yilun and Han, Simeng and Joty, Shafiq and Liu, Pengfei and Radev, Dragomir and Wu, Chien-Sheng and Cohan, Arman},
	journal={arXiv preprint arXiv:2311.09184},
	year={2023}
	}"""