Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.11.0
metadata
title: LLM As A Judge
emoji: π
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false
short_description: Compare the performance of different models.
LLM As A Judge π
LLM As A Judge is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.
Features β¨
- User-Specified Models: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
- Custom Prompts: Test models with any prompt of your choice.
- Evaluation Criteria: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
- Objective Evaluation: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.
Requirements βοΈ
- Only supports LLaMA models saved in GGUF format.
- Models must be hosted on Hugging Face and accessible via their repository names and filenames.
How It Works π οΈ
- Input Model Details: Provide the repository names and filenames for both models.
- Input Prompt: Enter the prompt to generate responses.
- Select Evaluation Criteria: Choose an evaluation criterion (e.g., clarity or relevance).
- Generate Responses and Evaluate:
- The app downloads and loads the specified models.
- Responses are generated for the given prompt using both models.
- The LoRA-4100 evaluation model evaluates the responses based on the selected criteria.
- View Results: Ratings, detailed explanations, and the declared winner or draw are displayed.
Behind the Scenes π
- Evaluation Model: The app uses the LoRA-4100 model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
- Dynamic Model Loading: The app downloads and loads models from Hugging Face dynamically based on user input.
- Inference: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.
Example π
Input:
- Model A Repository:
KolumbusLindh/LoRA-4100
- Model A Filename:
unsloth.F16.gguf
- Model B Repository:
forestav/gguf_lora_model
- Model B Filename:
finetune_v2.gguf
- Prompt: "Explain the significance of the Turing Test in artificial intelligence."
- Evaluation Criterion: Clarity
Output:
- Detailed evaluation results with scores for each model's response.
- Explanations for the scores based on the selected criterion.
- Declaration of the winning model or a draw.
Limitations π§
- Only works with LLaMA models in GGUF format.
- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.
Configuration Reference π
For detailed information on configuring a Hugging Face Space, visit the Spaces Config Reference.