--- title: LLM As A Judge emoji: 📚 colorFrom: red colorTo: pink sdk: gradio sdk_version: 5.8.0 app_file: app.py pinned: false short_description: Compare the performance of different models. --- # LLM As A Judge 📚 **LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model. ## Features ✨ - **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames. - **Custom Prompts**: Test models with any prompt of your choice. - **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity. - **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses. ## Requirements ⚙️ - Only supports **LLaMA models** saved in **GGUF format**. - Models must be hosted on Hugging Face and accessible via their repository names and filenames. ## How It Works 🛠️ 1. **Input Model Details**: Provide the repository names and filenames for both models. 2. **Input Prompt**: Enter the prompt to generate responses. 3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance). 4. **Generate Responses and Evaluate**: - The app downloads and loads the specified models. - Responses are generated for the given prompt using both models. - The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria. 5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed. ## Behind the Scenes 🔍 - **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses. - **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input. - **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model. ## Example 🌟 **Input:** - **Model A Repository**: `KolumbusLindh/LoRA-4100` - **Model A Filename**: `unsloth.F16.gguf` - **Model B Repository**: `forestav/gguf_lora_model` - **Model B Filename**: `finetune_v2.gguf` - **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."* - **Evaluation Criterion**: Clarity **Output:** - Detailed evaluation results with scores for each model's response. - Explanations for the scores based on the selected criterion. - Declaration of the winning model or a draw. ## Limitations 🚧 - Only works with **LLaMA models in GGUF format**. - The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks. ## Configuration Reference 📖 For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).