Spaces:

AIM-Harvard
/

rabbits-leaderboard

Running

App Files Files Community

pedromoreira22 commited on Jun 17, 2024

Commit

0d77329

1 Parent(s): 0520647

corrections

Browse files

Files changed (1) hide show

app.py +8 -7

app.py CHANGED Viewed

@@ -102,7 +102,7 @@ blog_posts = [
         ### Motivation and Problem
-        Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where precision is paramount.
         ### What We Did
@@ -139,22 +139,23 @@ blog_posts = [
         Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
         '''},
-    {"title": "DrugMathQA task (b4bqa)",
      "content": '''
-        **Exploring the DrugMathQA Task: Uncovering Hidden Challenges in Language Models 🩺📊**
         ### What We Did
-        Wwe introduced the DrugMathQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
-        1. **Creating the DrugMathQA Task (b4bqa):**
         - We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
         - Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
         2. **Dolma Dataset Counting:**
         - We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
-        - We identified overlaps between drug names in Dolma and the test sets of MedQA and MedMCQA, revealing significant contamination. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
         ### Results
@@ -191,7 +192,7 @@ blog_posts = [
         ### Conclusion
-        The DrugMathQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
         '''}
 ]

         ### Motivation and Problem
+        Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where factuality is paramount.
         ### What We Did
         Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
         '''},
+    {"title": "DrugMatchQA task (b4bqa)",
      "content": '''
+        **Exploring the DrugMatchQA Task: Uncovering Hidden Challenges in Language Models 🩺📊**
         ### What We Did
+        We introduced the DrugMatchQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
+        1. **Creating the DrugMatchQA Task (b4bqa):**
         - We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
         - Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
         2. **Dolma Dataset Counting:**
         - We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
+        - For the drugs found in the test sets of MedQA and MedMCQA, we found that generic names were much more common than brand names in Dolma.
+        - We found medical benchmark contamination in Dolma. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
         ### Results
         ### Conclusion
+        The DrugMatchQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
         '''}
 ]