pedromoreira22 commited on
Commit
0d77329
Β·
1 Parent(s): 0520647

corrections

Browse files
Files changed (1) hide show
  1. app.py +8 -7
app.py CHANGED
@@ -102,7 +102,7 @@ blog_posts = [
102
 
103
  ### Motivation and Problem
104
 
105
- Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where precision is paramount.
106
 
107
  ### What We Did
108
 
@@ -139,22 +139,23 @@ blog_posts = [
139
 
140
  Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
141
  '''},
142
- {"title": "DrugMathQA task (b4bqa)",
143
  "content": '''
144
 
145
- **Exploring the DrugMathQA Task: Uncovering Hidden Challenges in Language Models πŸ©ΊπŸ“Š**
146
 
147
  ### What We Did
148
 
149
- Wwe introduced the DrugMathQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
150
 
151
- 1. **Creating the DrugMathQA Task (b4bqa):**
152
  - We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
153
  - Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
154
 
155
  2. **Dolma Dataset Counting:**
156
  - We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
157
- - We identified overlaps between drug names in Dolma and the test sets of MedQA and MedMCQA, revealing significant contamination. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
 
158
 
159
  ### Results
160
 
@@ -191,7 +192,7 @@ blog_posts = [
191
 
192
  ### Conclusion
193
 
194
- The DrugMathQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
195
  '''}
196
  ]
197
 
 
102
 
103
  ### Motivation and Problem
104
 
105
+ Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where factuality is paramount.
106
 
107
  ### What We Did
108
 
 
139
 
140
  Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
141
  '''},
142
+ {"title": "DrugMatchQA task (b4bqa)",
143
  "content": '''
144
 
145
+ **Exploring the DrugMatchQA Task: Uncovering Hidden Challenges in Language Models πŸ©ΊπŸ“Š**
146
 
147
  ### What We Did
148
 
149
+ We introduced the DrugMatchQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
150
 
151
+ 1. **Creating the DrugMatchQA Task (b4bqa):**
152
  - We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
153
  - Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
154
 
155
  2. **Dolma Dataset Counting:**
156
  - We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
157
+ - For the drugs found in the test sets of MedQA and MedMCQA, we found that generic names were much more common than brand names in Dolma.
158
+ - We found medical benchmark contamination in Dolma. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
159
 
160
  ### Results
161
 
 
192
 
193
  ### Conclusion
194
 
195
+ The DrugMatchQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
196
  '''}
197
  ]
198