sunitha98 commited on
Commit
9624e27
1 Parent(s): 58e37b8

update tasks

Browse files
Files changed (1) hide show
  1. src/display/about.py +20 -12
src/display/about.py CHANGED
@@ -16,10 +16,10 @@ class Tasks(Enum):
16
 
17
  task0 = Task("finance_bench", "accuracy", "FinanceBench")
18
  task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
19
- task2 = Task("writing-prompts", "coherence", "Writing Prompts")
20
- task3 = Task("customer-support", "engagement", "Customer Support Dialogue")
21
- task4 = Task("toxic-prompts", "toxicity", "Toxic Prompts")
22
- task5 = Task("enterprise-pii", "accuracy", "Enterprise PII")
23
 
24
 
25
  # Your leaderboard name
@@ -35,20 +35,28 @@ LLM_BENCHMARKS_TEXT = f"""
35
  ## How it works
36
 
37
  ## Tasks
38
- 1. FinanceBench: The task measures the ability to answer financial questions given the context.
 
 
 
39
 
40
- 2. Legal Confidentiality: The task measures the ability of LLMs to reason over legal causes. The model is prompted
41
- to return yes/no as an answer to the question.
 
 
42
 
43
- 3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM.
 
44
 
45
  4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
46
- given some product information and conversational history.
 
47
 
48
- 5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information
49
- from LLMs.
50
 
51
- 6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs.
 
52
 
53
  ## Reproducibility
54
  All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.
 
16
 
17
  task0 = Task("finance_bench", "accuracy", "FinanceBench")
18
  task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
19
+ task2 = Task("writing_prompts", "coherence", "Writing Prompts")
20
+ task3 = Task("customer_support", "engagement", "Customer Support Dialogue")
21
+ task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
22
+ task5 = Task("enterprise_pii", "accuracy", "Enterprise PII")
23
 
24
 
25
  # Your leaderboard name
 
35
  ## How it works
36
 
37
  ## Tasks
38
+ 1. FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task
39
+ measures the ability to answer financial questions given the retrieved context from a document and a question. We do
40
+ not evaluate the retrieval capabilities for this task. We evaluate the accuracy of the answers. The dataset can be
41
+ found at https://huggingface.co/datasets/PatronusAI/financebench.
42
 
43
+ 2. Legal Confidentiality: We use a subset of 100 labelled prompts from LegalBench (Guha, et al. LegalBench: A
44
+ Collaboratively Built Benchmark for Measuring Legal Reasoning in \
45
+ Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return \
46
+ yes/no as an answer to the question.
47
 
48
+ 3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the
49
+ engagingness of the text generated by the LLM.
50
 
51
  4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
52
+ given some product information and conversational history. We measure the relevance of the generation given the
53
+ conversational history, product information and question by the customer.
54
 
55
+ 5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from
56
+ LLMs. We measure if the model generates toxic content.
57
 
58
+ 6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit
59
+ business-sensitive information from LLMs. We measure if the model generates business sensitive information.
60
 
61
  ## Reproducibility
62
  All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.