Spaces:

KolumbusLindh
/

LLM-as-a-judge

Sleeping

App Files Files Community

Kolumbus Lindh commited on Dec 9, 2024

Commit

7841304

1 Parent(s): 8f23865

updates

Browse files

Files changed (2) hide show

README.md +50 -1
app.py +33 -27

README.md CHANGED Viewed

@@ -10,4 +10,53 @@ pinned: false
 short_description: Compare the performance of different models.
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: Compare the performance of different models.
 ---
+# LLM As A Judge 📚
+**LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.
+## Features ✨
+- **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
+- **Custom Prompts**: Test models with any prompt of your choice.
+- **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
+- **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.
+## Requirements ⚙️
+- Only supports **LLaMA models** saved in **GGUF format**.
+- Models must be hosted on Hugging Face and accessible via their repository names and filenames.
+## How It Works 🛠️
+1. **Input Model Details**: Provide the repository names and filenames for both models.
+2. **Input Prompt**: Enter the prompt to generate responses.
+3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance).
+4. **Generate Responses and Evaluate**:
+   - The app downloads and loads the specified models.
+   - Responses are generated for the given prompt using both models.
+   - The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria.
+5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed.
+## Behind the Scenes 🔍
+- **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
+- **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input.
+- **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.
+## Example 🌟
+**Input:**
+- **Model A Repository**: `KolumbusLindh/LoRA-4100`
+- **Model A Filename**: `unsloth.F16.gguf`
+- **Model B Repository**: `forestav/gguf_lora_model`
+- **Model B Filename**: `finetune_v2.gguf`
+- **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."*
+- **Evaluation Criterion**: Clarity
+**Output:**
+- Detailed evaluation results with scores for each model's response.
+- Explanations for the scores based on the selected criterion.
+- Declaration of the winning model or a draw.
+## Limitations 🚧
+- Only works with **LLaMA models in GGUF format**.
+- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.
+## Configuration Reference 📖
+For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).

app.py CHANGED Viewed

@@ -2,7 +2,7 @@ import gradio as gr
 from llama_cpp import Llama
 from huggingface_hub import hf_hub_download
-# Function to load a user-specified model from Hugging Face
 def load_user_model(repo_id, model_file):
     print(f"Downloading model {model_file} from repository {repo_id}...")
     local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
@@ -14,9 +14,9 @@ def generate_response(model, prompt):
     response = model(prompt, max_tokens=256, temperature=0.7)
     return response["choices"][0]["text"]
-# Evaluate responses generated by two models using the LoRA evaluation model
 def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
-    # Load user-specified models
     model_a_instance = load_user_model(repo_a, model_a)
     model_b_instance = load_user_model(repo_b, model_b)
@@ -24,19 +24,21 @@ def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_crit
     response_a = generate_response(model_a_instance, prompt)
     response_b = generate_response(model_b_instance, prompt)
     print(f"Response A: {response_a}")
     print(f"Response B: {response_b}")
-    # Format the evaluation prompt for the LoRA model
     evaluation_prompt = f"""
 Prompt: {prompt}
 Response A: {response_a}
 Response B: {response_b}
-Evaluation Criteria: {evaluation_criteria}
-Please evaluate the responses based on the criteria above. Rate each response on a scale from 1 to 10 for each criterion and provide a detailed explanation. Finally, declare a winner or state 'draw' if they are equal.
 """
     # Use the LoRA model to evaluate the responses
     evaluation_response = lora_model.create_completion(
@@ -44,9 +46,17 @@ Please evaluate the responses based on the criteria above. Rate each response on
         max_tokens=512,
         temperature=0.5
     )
-    return evaluation_response["choices"][0]["text"]
-# Load the base LoRA evaluation model
 def load_lora_model():
     repo_id = "KolumbusLindh/LoRA-4100"
     model_file = "unsloth.F16.gguf"
@@ -62,41 +72,37 @@ print("LoRA evaluation model loaded successfully!")
 with gr.Blocks(title="LLM as a Judge") as demo:
     gr.Markdown("## LLM as a Judge 🧐")
-    # Inputs for Model A repository and file
-    repo_a_input = gr.Textbox(label="Model A Repository (e.g., KolumbusLindh/LoRA-4100)", placeholder="Enter the Hugging Face repo name for Model A...")
-    model_a_input = gr.Textbox(label="Model A File Name (e.g., unsloth.F16.gguf)", placeholder="Enter the model filename for Model A...")
-    # Inputs for Model B repository and file
-    repo_b_input = gr.Textbox(label="Model B Repository (e.g., KolumbusLindh/LoRA-4100)", placeholder="Enter the Hugging Face repo name for Model B...")
-    model_b_input = gr.Textbox(label="Model B File Name (e.g., unsloth.F16.gguf)", placeholder="Enter the model filename for Model B...")
-    # Input for prompt and evaluation criteria
     prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
-    criteria_dropdown = gr.Dropdown(
-        label="Select Evaluation Criteria",
         choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
-        value="Clarity",
-        type="value"
     )
-    # Button to evaluate responses
     evaluate_button = gr.Button("Evaluate Models")
-    # Output for evaluation results
     evaluation_output = gr.Textbox(
         label="Evaluation Results",
         placeholder="The evaluation results will appear here...",
-        lines=10,
         interactive=False
     )
-    # Link the evaluation function to the button
     evaluate_button.click(
         fn=evaluate_responses,
         inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
         outputs=[evaluation_output]
     )
-# Launch the Gradio app
 if __name__ == "__main__":
-    demo.launch()  # Add share=True to create a public link

 from llama_cpp import Llama
 from huggingface_hub import hf_hub_download
+# Load a user-specified model
 def load_user_model(repo_id, model_file):
     print(f"Downloading model {model_file} from repository {repo_id}...")
     local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
     response = model(prompt, max_tokens=256, temperature=0.7)
     return response["choices"][0]["text"]
+# Evaluate responses using the LoRA evaluation model
 def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
+    # Load models
     model_a_instance = load_user_model(repo_a, model_a)
     model_b_instance = load_user_model(repo_b, model_b)
     response_a = generate_response(model_a_instance, prompt)
     response_b = generate_response(model_b_instance, prompt)
+    # Display generated responses
     print(f"Response A: {response_a}")
     print(f"Response B: {response_b}")
+    # Format the evaluation prompt
+    criteria_list = ", ".join(evaluation_criteria)
     evaluation_prompt = f"""
 Prompt: {prompt}
 Response A: {response_a}
 Response B: {response_b}
+Evaluation Criteria: {criteria_list}
+Please evaluate the responses based on the selected criteria. For each criterion, rate both responses on a scale from 1 to 10 and provide a justification. Finally, declare the winner (or 'draw' if they are equal).
 """
     # Use the LoRA model to evaluate the responses
     evaluation_response = lora_model.create_completion(
         max_tokens=512,
         temperature=0.5
     )
+    evaluation_results = evaluation_response["choices"][0]["text"]
+    # Combine results for display
+    final_output = f"""
+Response A:\n{response_a}\n\n
+Response B:\n{response_b}\n\n
+Evaluation Results:\n{evaluation_results}
+"""
+    return final_output
+# Load the LoRA evaluation model
 def load_lora_model():
     repo_id = "KolumbusLindh/LoRA-4100"
     model_file = "unsloth.F16.gguf"
 with gr.Blocks(title="LLM as a Judge") as demo:
     gr.Markdown("## LLM as a Judge 🧐")
+    # Model inputs
+    repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...")
+    model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...")
+    repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...")
+    model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...")
+    # Prompt and criteria inputs
     prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
+    criteria_dropdown = gr.CheckboxGroup(
+        label="Select Up to 3 Evaluation Criteria",
         choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
+        value=["Clarity"],
+        max_choices=3
     )
+    # Button and outputs
     evaluate_button = gr.Button("Evaluate Models")
     evaluation_output = gr.Textbox(
         label="Evaluation Results",
         placeholder="The evaluation results will appear here...",
+        lines=20,
         interactive=False
     )
+    # Link evaluation function
     evaluate_button.click(
         fn=evaluate_responses,
         inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
         outputs=[evaluation_output]
     )
+# Launch app
 if __name__ == "__main__":
+    demo.launch()