metadata

title: LLM As A Judge
emoji: 📚
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false
short_description: Compare the performance of different models.

LLM As A Judge 📚

LLM As A Judge is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.

Features ✨

User-Specified Models: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
Custom Prompts: Test models with any prompt of your choice.
Evaluation Criteria: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
Objective Evaluation: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.

Requirements ⚙️

Only supports LLaMA models saved in GGUF format.
Models must be hosted on Hugging Face and accessible via their repository names and filenames.

How It Works 🛠️

Input Model Details: Provide the repository names and filenames for both models.
Input Prompt: Enter the prompt to generate responses.
Select Evaluation Criteria: Choose an evaluation criterion (e.g., clarity or relevance).
Generate Responses and Evaluate:
- The app downloads and loads the specified models.
- Responses are generated for the given prompt using both models.
- The LoRA-4100 evaluation model evaluates the responses based on the selected criteria.
View Results: Ratings, detailed explanations, and the declared winner or draw are displayed.

Behind the Scenes 🔍

Evaluation Model: The app uses the LoRA-4100 model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
Dynamic Model Loading: The app downloads and loads models from Hugging Face dynamically based on user input.
Inference: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.

Example 🌟

Input:

Model A Repository: KolumbusLindh/LoRA-4100
Model A Filename: unsloth.F16.gguf
Model B Repository: forestav/gguf_lora_model
Model B Filename: finetune_v2.gguf
Prompt: "Explain the significance of the Turing Test in artificial intelligence."
Evaluation Criterion: Clarity

Output:

Detailed evaluation results with scores for each model's response.
Explanations for the scores based on the selected criterion.
Declaration of the winning model or a draw.

Limitations 🚧

Only works with LLaMA models in GGUF format.
The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.

Configuration Reference 📖

For detailed information on configuring a Hugging Face Space, visit the Spaces Config Reference.