Spaces:

PatronusAI
/

enterprise_scenarios_leaderboard

Running on CPU Upgrade

App Files Files Community

sunitha98 commited on Jan 18

Commit

9624e27

•

1 Parent(s): 58e37b8

update tasks

Browse files

Files changed (1) hide show

src/display/about.py +20 -12

src/display/about.py CHANGED Viewed

@@ -16,10 +16,10 @@ class Tasks(Enum):
     task0 = Task("finance_bench", "accuracy", "FinanceBench")
     task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
-    task2 = Task("writing-prompts", "coherence", "Writing Prompts")
-    task3 = Task("customer-support", "engagement", "Customer Support Dialogue")
-    task4 = Task("toxic-prompts", "toxicity", "Toxic Prompts")
-    task5 = Task("enterprise-pii", "accuracy", "Enterprise PII")
 # Your leaderboard name
@@ -35,20 +35,28 @@ LLM_BENCHMARKS_TEXT = f"""
 ## How it works
 ## Tasks
-1. FinanceBench: The task measures the ability to answer financial questions given the context.
-2. Legal Confidentiality: The task measures the ability of LLMs to reason over legal causes. The model is prompted
-to return yes/no as an answer to the question.
-3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM.
 4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
-given some product information and conversational history.
-5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information
-from LLMs.
-6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs.
 ## Reproducibility
 All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.

     task0 = Task("finance_bench", "accuracy", "FinanceBench")
     task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
+    task2 = Task("writing_prompts", "coherence", "Writing Prompts")
+    task3 = Task("customer_support", "engagement", "Customer Support Dialogue")
+    task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
+    task5 = Task("enterprise_pii", "accuracy", "Enterprise PII")
 # Your leaderboard name
 ## How it works
 ## Tasks
+1. FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task
+measures the ability to answer financial questions given the retrieved context from a document and a question. We do
+not evaluate the retrieval capabilities for this task. We evaluate the accuracy of the answers.  The dataset can be
+found at https://huggingface.co/datasets/PatronusAI/financebench.
+2. Legal Confidentiality: We use a subset of 100 labelled prompts from LegalBench (Guha, et al. LegalBench: A
+Collaboratively Built Benchmark for Measuring Legal Reasoning in \
+Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return \
+yes/no as an answer to the question.
+3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the
+engagingness of the text generated by the LLM.
 4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
+given some product information and conversational history. We measure the relevance of the generation given the
+conversational history, product information and question by the customer.
+5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from
+ LLMs. We measure if the model generates toxic content.
+6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit
+business-sensitive information from LLMs. We measure if the model generates business sensitive information.
 ## Reproducibility
 All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.