Kolumbus Lindh commited on
Commit
7841304
Β·
1 Parent(s): 8f23865
Files changed (2) hide show
  1. README.md +50 -1
  2. app.py +33 -27
README.md CHANGED
@@ -10,4 +10,53 @@ pinned: false
10
  short_description: Compare the performance of different models.
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  short_description: Compare the performance of different models.
11
  ---
12
 
13
+ # LLM As A Judge πŸ“š
14
+
15
+ **LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.
16
+
17
+ ## Features ✨
18
+ - **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
19
+ - **Custom Prompts**: Test models with any prompt of your choice.
20
+ - **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
21
+ - **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.
22
+
23
+ ## Requirements βš™οΈ
24
+ - Only supports **LLaMA models** saved in **GGUF format**.
25
+ - Models must be hosted on Hugging Face and accessible via their repository names and filenames.
26
+
27
+ ## How It Works πŸ› οΈ
28
+ 1. **Input Model Details**: Provide the repository names and filenames for both models.
29
+ 2. **Input Prompt**: Enter the prompt to generate responses.
30
+ 3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance).
31
+ 4. **Generate Responses and Evaluate**:
32
+ - The app downloads and loads the specified models.
33
+ - Responses are generated for the given prompt using both models.
34
+ - The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria.
35
+ 5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed.
36
+
37
+ ## Behind the Scenes πŸ”
38
+ - **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
39
+ - **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input.
40
+ - **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.
41
+
42
+ ## Example 🌟
43
+ **Input:**
44
+ - **Model A Repository**: `KolumbusLindh/LoRA-4100`
45
+ - **Model A Filename**: `unsloth.F16.gguf`
46
+ - **Model B Repository**: `forestav/gguf_lora_model`
47
+ - **Model B Filename**: `finetune_v2.gguf`
48
+ - **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."*
49
+ - **Evaluation Criterion**: Clarity
50
+
51
+ **Output:**
52
+ - Detailed evaluation results with scores for each model's response.
53
+ - Explanations for the scores based on the selected criterion.
54
+ - Declaration of the winning model or a draw.
55
+
56
+ ## Limitations 🚧
57
+ - Only works with **LLaMA models in GGUF format**.
58
+ - The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.
59
+
60
+ ## Configuration Reference πŸ“–
61
+ For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).
62
+
app.py CHANGED
@@ -2,7 +2,7 @@ import gradio as gr
2
  from llama_cpp import Llama
3
  from huggingface_hub import hf_hub_download
4
 
5
- # Function to load a user-specified model from Hugging Face
6
  def load_user_model(repo_id, model_file):
7
  print(f"Downloading model {model_file} from repository {repo_id}...")
8
  local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
@@ -14,9 +14,9 @@ def generate_response(model, prompt):
14
  response = model(prompt, max_tokens=256, temperature=0.7)
15
  return response["choices"][0]["text"]
16
 
17
- # Evaluate responses generated by two models using the LoRA evaluation model
18
  def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
19
- # Load user-specified models
20
  model_a_instance = load_user_model(repo_a, model_a)
21
  model_b_instance = load_user_model(repo_b, model_b)
22
 
@@ -24,19 +24,21 @@ def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_crit
24
  response_a = generate_response(model_a_instance, prompt)
25
  response_b = generate_response(model_b_instance, prompt)
26
 
 
27
  print(f"Response A: {response_a}")
28
  print(f"Response B: {response_b}")
29
 
30
- # Format the evaluation prompt for the LoRA model
 
31
  evaluation_prompt = f"""
32
  Prompt: {prompt}
33
 
34
  Response A: {response_a}
35
  Response B: {response_b}
36
 
37
- Evaluation Criteria: {evaluation_criteria}
38
 
39
- Please evaluate the responses based on the criteria above. Rate each response on a scale from 1 to 10 for each criterion and provide a detailed explanation. Finally, declare a winner or state 'draw' if they are equal.
40
  """
41
  # Use the LoRA model to evaluate the responses
42
  evaluation_response = lora_model.create_completion(
@@ -44,9 +46,17 @@ Please evaluate the responses based on the criteria above. Rate each response on
44
  max_tokens=512,
45
  temperature=0.5
46
  )
47
- return evaluation_response["choices"][0]["text"]
 
 
 
 
 
 
 
 
48
 
49
- # Load the base LoRA evaluation model
50
  def load_lora_model():
51
  repo_id = "KolumbusLindh/LoRA-4100"
52
  model_file = "unsloth.F16.gguf"
@@ -62,41 +72,37 @@ print("LoRA evaluation model loaded successfully!")
62
  with gr.Blocks(title="LLM as a Judge") as demo:
63
  gr.Markdown("## LLM as a Judge 🧐")
64
 
65
- # Inputs for Model A repository and file
66
- repo_a_input = gr.Textbox(label="Model A Repository (e.g., KolumbusLindh/LoRA-4100)", placeholder="Enter the Hugging Face repo name for Model A...")
67
- model_a_input = gr.Textbox(label="Model A File Name (e.g., unsloth.F16.gguf)", placeholder="Enter the model filename for Model A...")
 
 
68
 
69
- # Inputs for Model B repository and file
70
- repo_b_input = gr.Textbox(label="Model B Repository (e.g., KolumbusLindh/LoRA-4100)", placeholder="Enter the Hugging Face repo name for Model B...")
71
- model_b_input = gr.Textbox(label="Model B File Name (e.g., unsloth.F16.gguf)", placeholder="Enter the model filename for Model B...")
72
-
73
- # Input for prompt and evaluation criteria
74
  prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
75
- criteria_dropdown = gr.Dropdown(
76
- label="Select Evaluation Criteria",
77
  choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
78
- value="Clarity",
79
- type="value"
80
  )
81
 
82
- # Button to evaluate responses
83
  evaluate_button = gr.Button("Evaluate Models")
84
-
85
- # Output for evaluation results
86
  evaluation_output = gr.Textbox(
87
  label="Evaluation Results",
88
  placeholder="The evaluation results will appear here...",
89
- lines=10,
90
  interactive=False
91
  )
92
 
93
- # Link the evaluation function to the button
94
  evaluate_button.click(
95
  fn=evaluate_responses,
96
  inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
97
  outputs=[evaluation_output]
98
  )
99
 
100
- # Launch the Gradio app
101
  if __name__ == "__main__":
102
- demo.launch() # Add share=True to create a public link
 
2
  from llama_cpp import Llama
3
  from huggingface_hub import hf_hub_download
4
 
5
+ # Load a user-specified model
6
  def load_user_model(repo_id, model_file):
7
  print(f"Downloading model {model_file} from repository {repo_id}...")
8
  local_path = hf_hub_download(repo_id=repo_id, filename=model_file)
 
14
  response = model(prompt, max_tokens=256, temperature=0.7)
15
  return response["choices"][0]["text"]
16
 
17
+ # Evaluate responses using the LoRA evaluation model
18
  def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
19
+ # Load models
20
  model_a_instance = load_user_model(repo_a, model_a)
21
  model_b_instance = load_user_model(repo_b, model_b)
22
 
 
24
  response_a = generate_response(model_a_instance, prompt)
25
  response_b = generate_response(model_b_instance, prompt)
26
 
27
+ # Display generated responses
28
  print(f"Response A: {response_a}")
29
  print(f"Response B: {response_b}")
30
 
31
+ # Format the evaluation prompt
32
+ criteria_list = ", ".join(evaluation_criteria)
33
  evaluation_prompt = f"""
34
  Prompt: {prompt}
35
 
36
  Response A: {response_a}
37
  Response B: {response_b}
38
 
39
+ Evaluation Criteria: {criteria_list}
40
 
41
+ Please evaluate the responses based on the selected criteria. For each criterion, rate both responses on a scale from 1 to 10 and provide a justification. Finally, declare the winner (or 'draw' if they are equal).
42
  """
43
  # Use the LoRA model to evaluate the responses
44
  evaluation_response = lora_model.create_completion(
 
46
  max_tokens=512,
47
  temperature=0.5
48
  )
49
+ evaluation_results = evaluation_response["choices"][0]["text"]
50
+
51
+ # Combine results for display
52
+ final_output = f"""
53
+ Response A:\n{response_a}\n\n
54
+ Response B:\n{response_b}\n\n
55
+ Evaluation Results:\n{evaluation_results}
56
+ """
57
+ return final_output
58
 
59
+ # Load the LoRA evaluation model
60
  def load_lora_model():
61
  repo_id = "KolumbusLindh/LoRA-4100"
62
  model_file = "unsloth.F16.gguf"
 
72
  with gr.Blocks(title="LLM as a Judge") as demo:
73
  gr.Markdown("## LLM as a Judge 🧐")
74
 
75
+ # Model inputs
76
+ repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...")
77
+ model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...")
78
+ repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...")
79
+ model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...")
80
 
81
+ # Prompt and criteria inputs
 
 
 
 
82
  prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
83
+ criteria_dropdown = gr.CheckboxGroup(
84
+ label="Select Up to 3 Evaluation Criteria",
85
  choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"],
86
+ value=["Clarity"],
87
+ max_choices=3
88
  )
89
 
90
+ # Button and outputs
91
  evaluate_button = gr.Button("Evaluate Models")
 
 
92
  evaluation_output = gr.Textbox(
93
  label="Evaluation Results",
94
  placeholder="The evaluation results will appear here...",
95
+ lines=20,
96
  interactive=False
97
  )
98
 
99
+ # Link evaluation function
100
  evaluate_button.click(
101
  fn=evaluate_responses,
102
  inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
103
  outputs=[evaluation_output]
104
  )
105
 
106
+ # Launch app
107
  if __name__ == "__main__":
108
+ demo.launch()