LLM-as-a-judge / README.md
Kolumbus Lindh
updates
7841304

A newer version of the Gradio SDK is available: 5.11.0

Upgrade
metadata
title: LLM As A Judge
emoji: πŸ“š
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false
short_description: Compare the performance of different models.

LLM As A Judge πŸ“š

LLM As A Judge is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.

Features ✨

  • User-Specified Models: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
  • Custom Prompts: Test models with any prompt of your choice.
  • Evaluation Criteria: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
  • Objective Evaluation: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.

Requirements βš™οΈ

  • Only supports LLaMA models saved in GGUF format.
  • Models must be hosted on Hugging Face and accessible via their repository names and filenames.

How It Works πŸ› οΈ

  1. Input Model Details: Provide the repository names and filenames for both models.
  2. Input Prompt: Enter the prompt to generate responses.
  3. Select Evaluation Criteria: Choose an evaluation criterion (e.g., clarity or relevance).
  4. Generate Responses and Evaluate:
    • The app downloads and loads the specified models.
    • Responses are generated for the given prompt using both models.
    • The LoRA-4100 evaluation model evaluates the responses based on the selected criteria.
  5. View Results: Ratings, detailed explanations, and the declared winner or draw are displayed.

Behind the Scenes πŸ”

  • Evaluation Model: The app uses the LoRA-4100 model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
  • Dynamic Model Loading: The app downloads and loads models from Hugging Face dynamically based on user input.
  • Inference: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.

Example 🌟

Input:

  • Model A Repository: KolumbusLindh/LoRA-4100
  • Model A Filename: unsloth.F16.gguf
  • Model B Repository: forestav/gguf_lora_model
  • Model B Filename: finetune_v2.gguf
  • Prompt: "Explain the significance of the Turing Test in artificial intelligence."
  • Evaluation Criterion: Clarity

Output:

  • Detailed evaluation results with scores for each model's response.
  • Explanations for the scores based on the selected criterion.
  • Declaration of the winning model or a draw.

Limitations 🚧

  • Only works with LLaMA models in GGUF format.
  • The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.

Configuration Reference πŸ“–

For detailed information on configuring a Hugging Face Space, visit the Spaces Config Reference.