Spaces:
Sleeping
Sleeping
title: LLM As A Judge | |
emoji: π | |
colorFrom: red | |
colorTo: pink | |
sdk: gradio | |
sdk_version: 5.8.0 | |
app_file: app.py | |
pinned: false | |
short_description: Compare the performance of different models. | |
# LLM As A Judge π | |
**LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model. | |
## Features β¨ | |
- **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames. | |
- **Custom Prompts**: Test models with any prompt of your choice. | |
- **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity. | |
- **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses. | |
## Requirements βοΈ | |
- Only supports **LLaMA models** saved in **GGUF format**. | |
- Models must be hosted on Hugging Face and accessible via their repository names and filenames. | |
## How It Works π οΈ | |
1. **Input Model Details**: Provide the repository names and filenames for both models. | |
2. **Input Prompt**: Enter the prompt to generate responses. | |
3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance). | |
4. **Generate Responses and Evaluate**: | |
- The app downloads and loads the specified models. | |
- Responses are generated for the given prompt using both models. | |
- The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria. | |
5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed. | |
## Behind the Scenes π | |
- **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses. | |
- **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input. | |
- **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model. | |
## Example π | |
**Input:** | |
- **Model A Repository**: `KolumbusLindh/LoRA-4100` | |
- **Model A Filename**: `unsloth.F16.gguf` | |
- **Model B Repository**: `forestav/gguf_lora_model` | |
- **Model B Filename**: `finetune_v2.gguf` | |
- **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."* | |
- **Evaluation Criterion**: Clarity | |
**Output:** | |
- Detailed evaluation results with scores for each model's response. | |
- Explanations for the scores based on the selected criterion. | |
- Declaration of the winning model or a draw. | |
## Limitations π§ | |
- Only works with **LLaMA models in GGUF format**. | |
- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks. | |
## Configuration Reference π | |
For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference). | |