Spaces:
Sleeping
Sleeping
Kolumbus Lindh
commited on
Commit
Β·
7841304
1
Parent(s):
8f23865
updates
Browse files
README.md
CHANGED
@@ -10,4 +10,53 @@ pinned: false
|
|
10 |
short_description: Compare the performance of different models.
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
short_description: Compare the performance of different models.
|
11 |
---
|
12 |
|
13 |
+
# LLM As A Judge π
|
14 |
+
|
15 |
+
**LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.
|
16 |
+
|
17 |
+
## Features β¨
|
18 |
+
- **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
|
19 |
+
- **Custom Prompts**: Test models with any prompt of your choice.
|
20 |
+
- **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
|
21 |
+
- **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.
|
22 |
+
|
23 |
+
## Requirements βοΈ
|
24 |
+
- Only supports **LLaMA models** saved in **GGUF format**.
|
25 |
+
- Models must be hosted on Hugging Face and accessible via their repository names and filenames.
|
26 |
+
|
27 |
+
## How It Works π οΈ
|
28 |
+
1. **Input Model Details**: Provide the repository names and filenames for both models.
|
29 |
+
2. **Input Prompt**: Enter the prompt to generate responses.
|
30 |
+
3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance).
|
31 |
+
4. **Generate Responses and Evaluate**:
|
32 |
+
- The app downloads and loads the specified models.
|
33 |
+
- Responses are generated for the given prompt using both models.
|
34 |
+
- The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria.
|
35 |
+
5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed.
|
36 |
+
|
37 |
+
## Behind the Scenes π
|
38 |
+
- **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
|
39 |
+
- **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input.
|
40 |
+
- **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.
|
41 |
+
|
42 |
+
## Example π
|
43 |
+
**Input:**
|
44 |
+
- **Model A Repository**: `KolumbusLindh/LoRA-4100`
|
45 |
+
- **Model A Filename**: `unsloth.F16.gguf`
|
46 |
+
- **Model B Repository**: `forestav/gguf_lora_model`
|
47 |
+
- **Model B Filename**: `finetune_v2.gguf`
|
48 |
+
- **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."*
|
49 |
+
- **Evaluation Criterion**: Clarity
|
50 |
+
|
51 |
+
**Output:**
|
52 |
+
- Detailed evaluation results with scores for each model's response.
|
53 |
+
- Explanations for the scores based on the selected criterion.
|
54 |
+
- Declaration of the winning model or a draw.
|
55 |
+
|
56 |
+
## Limitations π§
|
57 |
+
- Only works with **LLaMA models in GGUF format**.
|
58 |
+
- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.
|
59 |
+
|
60 |
+
## Configuration Reference π
|
61 |
+
For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).
|
62 |
+
|
app.py
CHANGED
@@ -2,7 +2,7 @@ import gradio as gr
|
|
2 |
from llama_cpp import Llama
|
3 |
from huggingface_hub import hf_hub_download
|
4 |
|
5 |
-
#
|
6 |
def load_user_model(repo_id, model_file):
|
7 |
print(f"Downloading model {model_file} from repository {repo_id}...")
|
8 |
local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
|
@@ -14,9 +14,9 @@ def generate_response(model, prompt):
|
|
14 |
response = model(prompt, max_tokens=256, temperature=0.7)
|
15 |
return response["choices"][0]["text"]
|
16 |
|
17 |
-
# Evaluate responses
|
18 |
def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
|
19 |
-
# Load
|
20 |
model_a_instance = load_user_model(repo_a, model_a)
|
21 |
model_b_instance = load_user_model(repo_b, model_b)
|
22 |
|
@@ -24,19 +24,21 @@ def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_crit
|
|
24 |
response_a = generate_response(model_a_instance, prompt)
|
25 |
response_b = generate_response(model_b_instance, prompt)
|
26 |
|
|
|
27 |
print(f"Response A: {response_a}")
|
28 |
print(f"Response B: {response_b}")
|
29 |
|
30 |
-
# Format the evaluation prompt
|
|
|
31 |
evaluation_prompt = f"""
|
32 |
Prompt: {prompt}
|
33 |
|
34 |
Response A: {response_a}
|
35 |
Response B: {response_b}
|
36 |
|
37 |
-
Evaluation Criteria: {
|
38 |
|
39 |
-
Please evaluate the responses based on the criteria
|
40 |
"""
|
41 |
# Use the LoRA model to evaluate the responses
|
42 |
evaluation_response = lora_model.create_completion(
|
@@ -44,9 +46,17 @@ Please evaluate the responses based on the criteria above. Rate each response on
|
|
44 |
max_tokens=512,
|
45 |
temperature=0.5
|
46 |
)
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
-
# Load the
|
50 |
def load_lora_model():
|
51 |
repo_id = "KolumbusLindh/LoRA-4100"
|
52 |
model_file = "unsloth.F16.gguf"
|
@@ -62,41 +72,37 @@ print("LoRA evaluation model loaded successfully!")
|
|
62 |
with gr.Blocks(title="LLM as a Judge") as demo:
|
63 |
gr.Markdown("## LLM as a Judge π§")
|
64 |
|
65 |
-
#
|
66 |
-
repo_a_input = gr.Textbox(label="Model A Repository
|
67 |
-
model_a_input = gr.Textbox(label="Model A File Name
|
|
|
|
|
68 |
|
69 |
-
#
|
70 |
-
repo_b_input = gr.Textbox(label="Model B Repository (e.g., KolumbusLindh/LoRA-4100)", placeholder="Enter the Hugging Face repo name for Model B...")
|
71 |
-
model_b_input = gr.Textbox(label="Model B File Name (e.g., unsloth.F16.gguf)", placeholder="Enter the model filename for Model B...")
|
72 |
-
|
73 |
-
# Input for prompt and evaluation criteria
|
74 |
prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
|
75 |
-
criteria_dropdown = gr.
|
76 |
-
label="Select Evaluation Criteria",
|
77 |
choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
|
78 |
-
value="Clarity",
|
79 |
-
|
80 |
)
|
81 |
|
82 |
-
# Button
|
83 |
evaluate_button = gr.Button("Evaluate Models")
|
84 |
-
|
85 |
-
# Output for evaluation results
|
86 |
evaluation_output = gr.Textbox(
|
87 |
label="Evaluation Results",
|
88 |
placeholder="The evaluation results will appear here...",
|
89 |
-
lines=
|
90 |
interactive=False
|
91 |
)
|
92 |
|
93 |
-
# Link
|
94 |
evaluate_button.click(
|
95 |
fn=evaluate_responses,
|
96 |
inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
|
97 |
outputs=[evaluation_output]
|
98 |
)
|
99 |
|
100 |
-
# Launch
|
101 |
if __name__ == "__main__":
|
102 |
-
demo.launch()
|
|
|
2 |
from llama_cpp import Llama
|
3 |
from huggingface_hub import hf_hub_download
|
4 |
|
5 |
+
# Load a user-specified model
|
6 |
def load_user_model(repo_id, model_file):
|
7 |
print(f"Downloading model {model_file} from repository {repo_id}...")
|
8 |
local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
|
|
|
14 |
response = model(prompt, max_tokens=256, temperature=0.7)
|
15 |
return response["choices"][0]["text"]
|
16 |
|
17 |
+
# Evaluate responses using the LoRA evaluation model
|
18 |
def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
|
19 |
+
# Load models
|
20 |
model_a_instance = load_user_model(repo_a, model_a)
|
21 |
model_b_instance = load_user_model(repo_b, model_b)
|
22 |
|
|
|
24 |
response_a = generate_response(model_a_instance, prompt)
|
25 |
response_b = generate_response(model_b_instance, prompt)
|
26 |
|
27 |
+
# Display generated responses
|
28 |
print(f"Response A: {response_a}")
|
29 |
print(f"Response B: {response_b}")
|
30 |
|
31 |
+
# Format the evaluation prompt
|
32 |
+
criteria_list = ", ".join(evaluation_criteria)
|
33 |
evaluation_prompt = f"""
|
34 |
Prompt: {prompt}
|
35 |
|
36 |
Response A: {response_a}
|
37 |
Response B: {response_b}
|
38 |
|
39 |
+
Evaluation Criteria: {criteria_list}
|
40 |
|
41 |
+
Please evaluate the responses based on the selected criteria. For each criterion, rate both responses on a scale from 1 to 10 and provide a justification. Finally, declare the winner (or 'draw' if they are equal).
|
42 |
"""
|
43 |
# Use the LoRA model to evaluate the responses
|
44 |
evaluation_response = lora_model.create_completion(
|
|
|
46 |
max_tokens=512,
|
47 |
temperature=0.5
|
48 |
)
|
49 |
+
evaluation_results = evaluation_response["choices"][0]["text"]
|
50 |
+
|
51 |
+
# Combine results for display
|
52 |
+
final_output = f"""
|
53 |
+
Response A:\n{response_a}\n\n
|
54 |
+
Response B:\n{response_b}\n\n
|
55 |
+
Evaluation Results:\n{evaluation_results}
|
56 |
+
"""
|
57 |
+
return final_output
|
58 |
|
59 |
+
# Load the LoRA evaluation model
|
60 |
def load_lora_model():
|
61 |
repo_id = "KolumbusLindh/LoRA-4100"
|
62 |
model_file = "unsloth.F16.gguf"
|
|
|
72 |
with gr.Blocks(title="LLM as a Judge") as demo:
|
73 |
gr.Markdown("## LLM as a Judge π§")
|
74 |
|
75 |
+
# Model inputs
|
76 |
+
repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...")
|
77 |
+
model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...")
|
78 |
+
repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...")
|
79 |
+
model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...")
|
80 |
|
81 |
+
# Prompt and criteria inputs
|
|
|
|
|
|
|
|
|
82 |
prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
|
83 |
+
criteria_dropdown = gr.CheckboxGroup(
|
84 |
+
label="Select Up to 3 Evaluation Criteria",
|
85 |
choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
|
86 |
+
value=["Clarity"],
|
87 |
+
max_choices=3
|
88 |
)
|
89 |
|
90 |
+
# Button and outputs
|
91 |
evaluate_button = gr.Button("Evaluate Models")
|
|
|
|
|
92 |
evaluation_output = gr.Textbox(
|
93 |
label="Evaluation Results",
|
94 |
placeholder="The evaluation results will appear here...",
|
95 |
+
lines=20,
|
96 |
interactive=False
|
97 |
)
|
98 |
|
99 |
+
# Link evaluation function
|
100 |
evaluate_button.click(
|
101 |
fn=evaluate_responses,
|
102 |
inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
|
103 |
outputs=[evaluation_output]
|
104 |
)
|
105 |
|
106 |
+
# Launch app
|
107 |
if __name__ == "__main__":
|
108 |
+
demo.launch()
|