kaikaidai commited on
Commit
f3cb34b
·
verified ·
1 Parent(s): 47e4bdb

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +31 -10
common.py CHANGED
@@ -47,18 +47,34 @@ EVAL_DESCRIPTION = """
47
  - Examples (Optional)
48
  """
49
 
50
- DEFAULT_EVAL_PROMPT = """You are assessing a chat bot response to a user's input based on how well it follows the user's instructions. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Do not allow the length of the response to influence your evaluation. Be objective as possible and give a brief explanation for your score.
51
 
52
- Score:
53
- Score 1: The response ignores or misinterprets instructions, providing irrelevant or inaccurate content that fails to address the request.
54
- Score 2: The response follows instructions partially but misses key elements, lacking depth or precision while containing minor inaccuracies.
55
- Score 3: The response follows main instructions adequately, providing correct and relevant information with reasonable depth.
56
- Score 4: The response follows instructions thoroughly with strong attention to detail, offering accurate, well-developed content that thoughtfully addresses needs.
57
- Score 5: The response demonstrates exceptional instruction following with precise, comprehensive content that shows both insight and perfect alignment with the request.
58
 
59
  [User Query]: {{input}}
60
 
61
- [Response]: {{response}}"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  # Default Variable Values
64
  DEFAULT_INPUT = """Which of these animals is least likely to be found in a rainforest?"
@@ -127,19 +143,24 @@ Judge Arena is specifically designed to assess AI models that function as evalua
127
  # FAQ
128
 
129
  **Isn't this the same as Chatbot Arena?**
 
130
  We are big fans of what the LMSYS team have done with Chatbot Arena and fully credit them for the inspiration to develop this. We were looking for a dynamic leaderboard that graded on AI judge capabilities and didn't manage to find one, so we created Judge Arena. This UI is designed especially for evals; to match the format of the model-based eval prompts that you would use in your LLM evaluation / monitoring tool.
131
 
132
  **Why should I trust this leaderboard?**
133
- We have listed out our efforts to be fully transparent in the policies above. All of the code for this leaderboard is open-source and can be found on our [Github](https://github.com/atla-ai/judge-arena).
 
134
 
135
  **Who funds this effort?**
 
136
  Atla currently funds this out of our own pocket. We are looking for API credits (with no strings attached) to support this effort - please get in touch if you or someone you know might be able to help.
137
 
138
  **What is Atla working on?**
 
139
  We are training a general-purpose evaluator that you will soon be able to run in this Judge Arena. Our next step will be to open-source a powerful model that the community can use to run fast and accurate evaluations.
140
  <br><br>
141
  # Get in touch
142
- Feel free to email us at [support@atla-ai.com](mailto:support@atla-ai.com) or leave feedback on our [Github](https://github.com/atla-ai/judge-arena)!"""
 
143
 
144
 
145
 
 
47
  - Examples (Optional)
48
  """
49
 
50
+ DEFAULT_EVAL_PROMPT = """You are assessing a chat bot response to a user's input. Your evaluation should focus on the helpfulness of the response given the user's instructions. Do not allow the length of the response to influence your evaluation. Be objective as possible and give a brief explanation for your score.
51
 
52
+ Scoring Rubric:
53
+ Score 1: The response is unhelpful, providing irrelevant or incorrect content that does not address the request.
54
+ Score 2: The response is partially helpful, missing key elements or including minor inaccuracies, and lacks depth in addressing the request.
55
+ Score 3: The response is adequately helpful, correctly addressing the main request with relevant information and some depth.
56
+ Score 4: The response is very helpful, addressing the request thoroughly with accurate and detailed content, but may lack a minor aspect of helpfulness.
57
+ Score 5: The response is exceptionally helpful, providing precise, comprehensive content that fully resolves the request with insight and clarity.
58
 
59
  [User Query]: {{input}}
60
 
61
+ [AI Response]: {{response}}"""
62
+
63
+ # Split the eval prompt into editable and fixed parts
64
+ DEFAULT_EVAL_PROMPT_EDITABLE = """You are assessing a chat bot response to a user's input. Your evaluation should focus on the helpfulness of the response given the user's instructions. Do not allow the length of the response to influence your evaluation. Be objective as possible and give a brief explanation for your score.
65
+
66
+ Scoring Rubric:
67
+ Score 1: The response is unhelpful, providing irrelevant or incorrect content that does not address the request.
68
+ Score 2: The response is partially helpful, missing key elements or including minor inaccuracies, and lacks depth in addressing the request.
69
+ Score 3: The response is adequately helpful, correctly addressing the main request with relevant information and some depth.
70
+ Score 4: The response is very helpful, addressing the request thoroughly with accurate and detailed content, but may lack a minor aspect of helpfulness.
71
+ Score 5: The response is exceptionally helpful, providing precise, comprehensive content that fully resolves the request with insight and clarity."""
72
+
73
+ # Fixed suffix that will always be appended
74
+ FIXED_EVAL_SUFFIX = """
75
+ [User Query]: {{input}}
76
+
77
+ [AI Response]: {{response}}"""
78
 
79
  # Default Variable Values
80
  DEFAULT_INPUT = """Which of these animals is least likely to be found in a rainforest?"
 
143
  # FAQ
144
 
145
  **Isn't this the same as Chatbot Arena?**
146
+
147
  We are big fans of what the LMSYS team have done with Chatbot Arena and fully credit them for the inspiration to develop this. We were looking for a dynamic leaderboard that graded on AI judge capabilities and didn't manage to find one, so we created Judge Arena. This UI is designed especially for evals; to match the format of the model-based eval prompts that you would use in your LLM evaluation / monitoring tool.
148
 
149
  **Why should I trust this leaderboard?**
150
+
151
+ We have listed out our efforts to be fully transparent in the policies above. All of the code for this leaderboard is open-source and can be found on our [Github](https://github.com/atla-ai/judge-arena). Check out our [blog](https://www.atla-ai.com/blog) to stay up to date as we analyse the results from the leaderboard.
152
 
153
  **Who funds this effort?**
154
+
155
  Atla currently funds this out of our own pocket. We are looking for API credits (with no strings attached) to support this effort - please get in touch if you or someone you know might be able to help.
156
 
157
  **What is Atla working on?**
158
+
159
  We are training a general-purpose evaluator that you will soon be able to run in this Judge Arena. Our next step will be to open-source a powerful model that the community can use to run fast and accurate evaluations.
160
  <br><br>
161
  # Get in touch
162
+ We’d love to hear your feedback! For general feature requests or to submit / suggest new models to add to the arena, please open up a discussion in the [community](https://huggingface.co/spaces/AtlaAI/judge-arena/discussions) tab. You can also contact us directly on [X](https://x.com/Atla_AI) or [Discord](https://discord.gg/yNpUAMqs).
163
+ \nPlease file any issues on our [Github](https://github.com/atla-ai/judge-arena)."""
164
 
165
 
166