p2l-1.5b-grk-01112025 / README.md

commit files to HF hub

4e74128 4 days ago

7.22 kB

	# lmarena-ai/p2l-1.5b-grk-01112025

	Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance.
	To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt.
	The core idea is to train an LLM taking natural language prompts as input to output a vector of coefficients which are then used to predict the human preference vote.
	The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses.
	Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard.

	Paper: [Prompt-to-Leaderboard](https://arxiv.org/abs/2502.14855)

	Code: [lmarena/p2l](https://github.com/lmarena/p2l)

	This particular P2L model has a Grounded Rao-Kupper regression head, which we define below:

	Let
	$$
	Y\in \{\mathsf{A}, \mathsf{B}, \mathsf{tie}, \mathsf{bad}\}
	$$
	and for the sake of notational convenience, let
	$$
	\theta^(z) = \big(\beta^(z), \eta^(z)\big); \ \beta^(z) \in \mathbb{R}^M, \eta^*(z) \in \mathbb{R}_{\geq 1}\}
	$$

	For notational convenience, we define:
	$$
	\varphi^(z)_i := \exp(\beta^(z)_i)
	$$

	Then grounded Rao-Kupper model is defined as:
	$$
	g_{\theta^*(z)}(y ; x) =
	\begin{cases}
	\frac{\varphi^(z)_A}{\varphi^(z)_A + \eta^(z)\varphi^(z)_B + 1} & y = \mathsf{A} \\
	\frac{\varphi^(z)_B}{\varphi^(z)_B + \eta^(z)\varphi^(z)_A + 1} & y = \mathsf{B}\\
	\frac{1}{1 + \varphi^(z)_A + \varphi^(z)_B} & y = \mathsf{bad}\\
	1 - \frac{\varphi^(z)_A}{\varphi^(z)_A + \eta^(z)\varphi^(z)_B + 1} - \frac{\varphi^(z)_B}{\varphi^(z)_B + \eta^(z)\varphi^(z)_A + 1} - \frac{1}{1 + \varphi^(z)_A + \varphi^(z)_B} & y = \mathsf{tie}.
	\end{cases}
	$$

	See section 2.2 in our paper for more details on various regression heads.

	## Serving
	To serve a P2L model, please see our documentation on GitHub: [Serving P2L](https://github.com/lmarena/p2l?tab=readme-ov-file#serving-p2l).

	Note: the P2L model outputs with this structure:


	```python
	class P2LOutputs(ModelOutput):
	coefs: torch.FloatTensor = None # "betas" as described above
	eta: Optional[torch.FloatTensor] = None # tie coefficent (also eta above)
	last_hidden_state: torch.FloatTensor = None # last hidden state from the transformer
	```

	To understand which coefficient index corresponds with which model, see the [`model_list.json`](./model_list.json) found in the repo of each P2L model. As a general rule, the models will always be in sorted order.

	The easiest way to get this list from inside code is with the following:

	```python
	import json
	from huggingface_hub import hf_hub_download

	fname = hf_hub_download(
	repo_id="lmarena-ai/p2l-1.5b-grk-01112025", filename="model_list.json", repo_type="model"
	)

	with open(fname) as fin:
	model_list = json.load(fin)
	```



	### Loading from Pretrained

	To define and load the model:

	```python

	import torch
	from transformers import (
	Qwen2Model,
	Qwen2PreTrainedModel,
	LlamaModel,
	LlamaPreTrainedModel,
	PreTrainedModel,
	AutoTokenizer,
	)
	from transformers import AutoTokenizer
	from transformers.utils import ModelOutput
	from dataclasses import dataclass
	import torch.nn as nn
	import torch.nn.functional as F
	from typing import Dict, Tuple, Callable, Optional
	from huggingface_hub import hf_hub_download
	import json


	@dataclass
	class HeadOutputs(ModelOutput):
	coefs: torch.FloatTensor = None
	eta: Optional[torch.FloatTensor] = None
	gamma: Optional[torch.FloatTensor] = None


	@dataclass
	class P2LOutputs(ModelOutput):
	coefs: torch.FloatTensor = None
	eta: Optional[torch.FloatTensor] = None
	gamma: Optional[torch.FloatTensor] = None
	loss: Optional[torch.FloatTensor] = None
	last_hidden_state: torch.FloatTensor = None

	class RKHead(nn.Module):
	def __init__(
	self,
	input_dim,
	output_dim,
	**kwargs,
	) -> None:
	super().__init__()
	self.head = nn.Linear(
	in_features=input_dim, out_features=output_dim, bias=True
	)
	self.eta_head = nn.Linear(
	in_features=input_dim, out_features=1, bias=True
	)

	def forward(self, last_hidden_dim: torch.Tensor):
	coefs = self.head(last_hidden_dim)
	eta = self.eta_head(last_hidden_dim)

	return HeadOutputs(coefs=coefs, eta=eta)

	class P2LModel(Qwen2PreTrainedModel):
	def __init__(
	self,
	config,
	CLS_id,
	num_models,
	head_kwargs={},
	**kwargs,
	):
	super().__init__(config)

	self.num_models = num_models
	self.cls_token_id = CLS_id

	self.model = Qwen2Model(config)

	self.head = RKHead(
	input_dim=config.hidden_size,
	output_dim=self.num_models,
	**head_kwargs,
	)

	self.post_init()

	def freeze_transformer(self):
	for param in self.model.parameters():
	param.requires_grad = False

	def get_input_embeddings(self):
	return self.model.embed_tokens

	def set_input_embeddings(self, value):
	self.model.embed_tokens = value

	def forward(self, input_ids, attention_mask, labels=None, weights=None):
	batch_size = input_ids.shape[0]

	hidden_outputs = self.model(
	input_ids=input_ids,
	attention_mask=attention_mask,
	output_hidden_states=False,
	).last_hidden_state # (bs, num_token, embed_dim)

	cls_mask = input_ids == self.cls_token_id

	# double check this is getting the current CLS token
	cls_hidden_dim = hidden_outputs[cls_mask]

	assert (
	cls_hidden_dim.shape[0] == batch_size
	), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}"

	head_output = self.head(cls_hidden_dim)


	outputs = P2LOutputs(
	coefs=head_output.coefs,
	last_hidden_state=cls_hidden_dim,
	eta=head_output.eta,
	gamma=head_output.gamma,
	)

	return outputs


	fname = hf_hub_download(
	repo_id="lmarena-ai/p2l-1.5b-grk-01112025", filename="model_list.json", repo_type="model"
	)

	with open(fname) as fin:
	model_list = json.load(fin)

	tokenizer = AutoTokenizer.from_pretrained("lmarena-ai/p2l-1.5b-grk-01112025")
	model = P2LModel.from_pretrained(
	"lmarena-ai/p2l-1.5b-grk-01112025",
	CLS_id=tokenizer.cls_token_id,
	num_models=len(model_list),
	torch_dtype=torch.bfloat16,
	)

	```

	## Citation

	```
	@misc{frick2025prompttoleaderboard,
	title={Prompt-to-Leaderboard},
	author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica},
	year={2025},
	eprint={2502.14855},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.14855},
	}
	```