|
# lmarena-ai/p2l-1.5b-grk-01112025 |
|
|
|
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. |
|
To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. |
|
The core idea is to train an LLM taking natural language prompts as input to output a vector of coefficients which are then used to predict the human preference vote. |
|
The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. |
|
Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. |
|
|
|
**Paper**: [Prompt-to-Leaderboard](https://arxiv.org/abs/2502.14855) |
|
|
|
**Code**: [lmarena/p2l](https://github.com/lmarena/p2l) |
|
|
|
This particular P2L model has a *Grounded Rao-Kupper* regression head, which we define below: |
|
|
|
Let |
|
$$ |
|
Y\in \{\mathsf{A}, \mathsf{B}, \mathsf{tie}, \mathsf{bad}\} |
|
$$ |
|
and for the sake of notational convenience, let |
|
$$ |
|
\theta^*(z) = \big(\beta^*(z), \eta^*(z)\big); \ \beta^*(z) \in \mathbb{R}^M, \eta^*(z) \in \mathbb{R}_{\geq 1}\} |
|
$$ |
|
|
|
For notational convenience, we define: |
|
$$ |
|
\varphi^*(z)_i := \exp(\beta^*(z)_i) |
|
$$ |
|
|
|
Then grounded Rao-Kupper model is defined as: |
|
$$ |
|
g_{\theta^*(z)}(y ; x) = |
|
\begin{cases} |
|
\frac{\varphi^*(z)_A}{\varphi^*(z)_A + \eta^*(z)\varphi^*(z)_B + 1} & y = \mathsf{A} \\ |
|
\frac{\varphi^*(z)_B}{\varphi^*(z)_B + \eta^*(z)\varphi^*(z)_A + 1} & y = \mathsf{B}\\ |
|
\frac{1}{1 + \varphi^*(z)_A + \varphi^*(z)_B} & y = \mathsf{bad}\\ |
|
1 - \frac{\varphi^*(z)_A}{\varphi^*(z)_A + \eta^*(z)\varphi^*(z)_B + 1} - \frac{\varphi^*(z)_B}{\varphi^*(z)_B + \eta^*(z)\varphi^*(z)_A + 1} - \frac{1}{1 + \varphi^*(z)_A + \varphi^*(z)_B} & y = \mathsf{tie}. |
|
\end{cases} |
|
$$ |
|
|
|
See section 2.2 in our paper for more details on various regression heads. |
|
|
|
## Serving |
|
To serve a P2L model, please see our documentation on GitHub: [Serving P2L](https://github.com/lmarena/p2l?tab=readme-ov-file#serving-p2l). |
|
|
|
Note: the P2L model outputs with this structure: |
|
|
|
|
|
```python |
|
class P2LOutputs(ModelOutput): |
|
coefs: torch.FloatTensor = None # "betas" as described above |
|
eta: Optional[torch.FloatTensor] = None # tie coefficent (also eta above) |
|
last_hidden_state: torch.FloatTensor = None # last hidden state from the transformer |
|
``` |
|
|
|
To understand which coefficient index corresponds with which model, see the [`model_list.json`](./model_list.json) found in the repo of each P2L model. As a general rule, the models will always be in sorted order. |
|
|
|
The easiest way to get this list from inside code is with the following: |
|
|
|
```python |
|
import json |
|
from huggingface_hub import hf_hub_download |
|
|
|
fname = hf_hub_download( |
|
repo_id="lmarena-ai/p2l-1.5b-grk-01112025", filename="model_list.json", repo_type="model" |
|
) |
|
|
|
with open(fname) as fin: |
|
model_list = json.load(fin) |
|
``` |
|
|
|
|
|
|
|
### Loading from Pretrained |
|
|
|
To define and load the model: |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import ( |
|
Qwen2Model, |
|
Qwen2PreTrainedModel, |
|
LlamaModel, |
|
LlamaPreTrainedModel, |
|
PreTrainedModel, |
|
AutoTokenizer, |
|
) |
|
from transformers import AutoTokenizer |
|
from transformers.utils import ModelOutput |
|
from dataclasses import dataclass |
|
import torch.nn as nn |
|
import torch.nn.functional as F |
|
from typing import Dict, Tuple, Callable, Optional |
|
from huggingface_hub import hf_hub_download |
|
import json |
|
|
|
|
|
@dataclass |
|
class HeadOutputs(ModelOutput): |
|
coefs: torch.FloatTensor = None |
|
eta: Optional[torch.FloatTensor] = None |
|
gamma: Optional[torch.FloatTensor] = None |
|
|
|
|
|
@dataclass |
|
class P2LOutputs(ModelOutput): |
|
coefs: torch.FloatTensor = None |
|
eta: Optional[torch.FloatTensor] = None |
|
gamma: Optional[torch.FloatTensor] = None |
|
loss: Optional[torch.FloatTensor] = None |
|
last_hidden_state: torch.FloatTensor = None |
|
|
|
class RKHead(nn.Module): |
|
def __init__( |
|
self, |
|
input_dim, |
|
output_dim, |
|
**kwargs, |
|
) -> None: |
|
super().__init__() |
|
self.head = nn.Linear( |
|
in_features=input_dim, out_features=output_dim, bias=True |
|
) |
|
self.eta_head = nn.Linear( |
|
in_features=input_dim, out_features=1, bias=True |
|
) |
|
|
|
def forward(self, last_hidden_dim: torch.Tensor): |
|
coefs = self.head(last_hidden_dim) |
|
eta = self.eta_head(last_hidden_dim) |
|
|
|
return HeadOutputs(coefs=coefs, eta=eta) |
|
|
|
class P2LModel(Qwen2PreTrainedModel): |
|
def __init__( |
|
self, |
|
config, |
|
CLS_id, |
|
num_models, |
|
head_kwargs={}, |
|
**kwargs, |
|
): |
|
super().__init__(config) |
|
|
|
self.num_models = num_models |
|
self.cls_token_id = CLS_id |
|
|
|
self.model = Qwen2Model(config) |
|
|
|
self.head = RKHead( |
|
input_dim=config.hidden_size, |
|
output_dim=self.num_models, |
|
**head_kwargs, |
|
) |
|
|
|
self.post_init() |
|
|
|
def freeze_transformer(self): |
|
for param in self.model.parameters(): |
|
param.requires_grad = False |
|
|
|
def get_input_embeddings(self): |
|
return self.model.embed_tokens |
|
|
|
def set_input_embeddings(self, value): |
|
self.model.embed_tokens = value |
|
|
|
def forward(self, input_ids, attention_mask, labels=None, weights=None): |
|
batch_size = input_ids.shape[0] |
|
|
|
hidden_outputs = self.model( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
output_hidden_states=False, |
|
).last_hidden_state # (bs, num_token, embed_dim) |
|
|
|
cls_mask = input_ids == self.cls_token_id |
|
|
|
# double check this is getting the current CLS token |
|
cls_hidden_dim = hidden_outputs[cls_mask] |
|
|
|
assert ( |
|
cls_hidden_dim.shape[0] == batch_size |
|
), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}" |
|
|
|
head_output = self.head(cls_hidden_dim) |
|
|
|
|
|
outputs = P2LOutputs( |
|
coefs=head_output.coefs, |
|
last_hidden_state=cls_hidden_dim, |
|
eta=head_output.eta, |
|
gamma=head_output.gamma, |
|
) |
|
|
|
return outputs |
|
|
|
|
|
fname = hf_hub_download( |
|
repo_id="lmarena-ai/p2l-1.5b-grk-01112025", filename="model_list.json", repo_type="model" |
|
) |
|
|
|
with open(fname) as fin: |
|
model_list = json.load(fin) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("lmarena-ai/p2l-1.5b-grk-01112025") |
|
model = P2LModel.from_pretrained( |
|
"lmarena-ai/p2l-1.5b-grk-01112025", |
|
CLS_id=tokenizer.cls_token_id, |
|
num_models=len(model_list), |
|
torch_dtype=torch.bfloat16, |
|
) |
|
|
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{frick2025prompttoleaderboard, |
|
title={Prompt-to-Leaderboard}, |
|
author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica}, |
|
year={2025}, |
|
eprint={2502.14855}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2502.14855}, |
|
} |
|
``` |