|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
- en |
|
base_model: |
|
- jinaai/jina-embeddings-v3 |
|
--- |
|
|
|
## **JinaJudge: Proxy Judgement for Russian LLM Arena** |
|
|
|
### **Description** |
|
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models. |
|
|
|
--- |
|
|
|
### **Model Details** |
|
- **Architecture**: Utilizes a `jina-embeddings-v3` encoder for feature extraction, followed by 4 transformer-decoder blocks. |
|
- **Data Source**: The training data was collected from the Russian LLM Arena. Data contradictions were filtered, and transitive examples were added for better generalization. |
|
- **Judgement Classes**: Though the original arena includes five judgement categories (`A>>B`, `A>B`, `A=B`, `B>A`, `B>>A`), the model consolidates them into three simplified classes: |
|
- **A > B** |
|
- **A = B** |
|
- **B > A** |
|
- **Training**: The model underwent full-weight fine-tuning with the Adam optimizer over 30 epochs. A maximum sequence length of 4096 was set, and the best weights were chosen based on final performance. |
|
|
|
--- |
|
|
|
### **Evaluation** |
|
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training. |
|
|
|
**Models evaluated**: |
|
- **gemma-2-9b-it-sppo-iter3** |
|
- **glm-4-9b-chat** |
|
- **gpt-3.5-turbo-1106** |
|
- **mistral-7b-instruct-v0.3** |
|
- **storm-7b** |
|
|
|
**Validation Performance**: |
|
- **Accuracy**: 78.09% |
|
- **Precision**: 75.82% |
|
- **Recall**: 76.77% |
|
- **F1-score**: 76.27% |
|
|
|
For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model. |
|
|
|
**Test Performance**: |
|
- **Accuracy**: 80.07% |
|
- **Precision**: 76.68% |
|
- **Recall**: 77.73% |
|
- **F1-score**: 77.08% |
|
|
|
--- |
|
|
|
### **Error Analysis** |
|
Upon reviewing erroneous predictions, the following observations were made: |
|
1. **Preference for English**: In some cases, the model selects better English responses over superior Russian ones. |
|
2. **Difficulty with Paraphrasing**: The model occasionally struggles with distinguishing between paraphrased responses. |
|
3. **Ambiguous Prompts**: A significant portion of the errors arises from prompts in the Russian LLM Arena that don't allow for deterministic judgements, leading to noise in the evaluation data. |
|
|
|
While there is potential to improve alignment between this model and GPT-4, achieving an accuracy beyond 85% is unlikely due to the inherent noise in the benchmarks. |
|
|
|
--- |
|
|
|
### **Usage Example** |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge", trust_remote_code=True) |
|
|
|
prompt_template = """ |
|
<user prompt> |
|
{user_prompt} |
|
<end> |
|
<assistant A answer> |
|
{assistant_a} |
|
<end> |
|
<assistant B answer> |
|
{assistant_b} |
|
<end> |
|
""".strip() |
|
|
|
prompt = "your prompt" |
|
assistant_a = "assistant a response" |
|
assistant_b = "assistant b response" |
|
|
|
example = prompt_template.format( |
|
user_prompt=user_prompt, |
|
assistant_a=assistant_a, |
|
assistant_b=assistant_b, |
|
) |
|
|
|
judgement = jina([example])[0].argmax() |
|
|
|
judgement_map = { |
|
0: "A is better than B", |
|
1: "A == B", |
|
2: "B is better than A" |
|
} |
|
|
|
print(judgement_map[judgement]) |
|
``` |