sunitha98 commited on
Commit
c4e3e53
1 Parent(s): ab74ccc

update about

Browse files
Files changed (1) hide show
  1. src/display/about.py +7 -7
src/display/about.py CHANGED
@@ -17,7 +17,7 @@ class Tasks(Enum):
17
  task0 = Task("finance_bench", "accuracy", "FinanceBench")
18
  task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
19
  task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
20
- # task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
21
  task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
22
  task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
23
 
@@ -35,19 +35,19 @@ LLM_BENCHMARKS_TEXT = f"""
35
  ## How it works
36
 
37
  ## Tasks
38
- 1. FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
39
  found at https://huggingface.co/datasets/PatronusAI/financebench.
40
 
41
- 2. Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
42
  Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
43
 
44
- 3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM. The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
45
 
46
- 4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
47
 
48
- 5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
49
 
50
- 6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
51
 
52
  ## What is Patronus AI?
53
 
 
17
  task0 = Task("finance_bench", "accuracy", "FinanceBench")
18
  task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
19
  task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
20
+ task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
21
  task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
22
  task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
23
 
 
35
  ## How it works
36
 
37
  ## Tasks
38
+ 1.FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
39
  found at https://huggingface.co/datasets/PatronusAI/financebench.
40
 
41
+ 2.Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
42
  Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
43
 
44
+ 3.Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM. The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
45
 
46
+ 4.Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
47
 
48
+ 5.Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
49
 
50
+ 6.Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
51
 
52
  ## What is Patronus AI?
53