pedromoreira22
commited on
Commit
Β·
0d77329
1
Parent(s):
0520647
corrections
Browse files
app.py
CHANGED
@@ -102,7 +102,7 @@ blog_posts = [
|
|
102 |
|
103 |
### Motivation and Problem
|
104 |
|
105 |
-
Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where
|
106 |
|
107 |
### What We Did
|
108 |
|
@@ -139,22 +139,23 @@ blog_posts = [
|
|
139 |
|
140 |
Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
|
141 |
'''},
|
142 |
-
{"title": "
|
143 |
"content": '''
|
144 |
|
145 |
-
**Exploring the
|
146 |
|
147 |
### What We Did
|
148 |
|
149 |
-
|
150 |
|
151 |
-
1. **Creating the
|
152 |
- We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
|
153 |
- Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
|
154 |
|
155 |
2. **Dolma Dataset Counting:**
|
156 |
- We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
|
157 |
-
-
|
|
|
158 |
|
159 |
### Results
|
160 |
|
@@ -191,7 +192,7 @@ blog_posts = [
|
|
191 |
|
192 |
### Conclusion
|
193 |
|
194 |
-
The
|
195 |
'''}
|
196 |
]
|
197 |
|
|
|
102 |
|
103 |
### Motivation and Problem
|
104 |
|
105 |
+
Language models (LLMs) like GPT are touted as game-changers in the medical field, providing support in data processing and decision-making. However, there's a significant challenge: these models struggle with the variability in drug names. Patients often use brand names (like Tylenol) instead of generic equivalents (like acetaminophen), and this can confuse LLMs, leading to decreased accuracy and potential misinformation. This is a critical issue in healthcare, where factuality is paramount.
|
106 |
|
107 |
### What We Did
|
108 |
|
|
|
139 |
|
140 |
Check out the full RABBITS leaderboard on Hugging Face to see how different models compare!
|
141 |
'''},
|
142 |
+
{"title": "DrugMatchQA task (b4bqa)",
|
143 |
"content": '''
|
144 |
|
145 |
+
**Exploring the DrugMatchQA Task: Uncovering Hidden Challenges in Language Models π©Ίπ**
|
146 |
|
147 |
### What We Did
|
148 |
|
149 |
+
We introduced the DrugMatchQA task (b4bqa) and leveraged the Dolma dataset for detailed analysis. Here's a breakdown of our approach:
|
150 |
|
151 |
+
1. **Creating the DrugMatchQA Task (b4bqa):**
|
152 |
- We developed a specialized benchmark by transforming existing medical QA datasets (MedQA and MedMCQA) to test LLMs' robustness in understanding drug name synonyms.
|
153 |
- Using regular expressions, we swapped brand names with their generic equivalents and vice versa, creating two new datasets: brand-to-generic (b2g) and generic-to-brand (g2b).
|
154 |
|
155 |
2. **Dolma Dataset Counting:**
|
156 |
- We analyzed the Dolma dataset, a massive collection of 3.1 trillion tokens, to understand how frequently brand and generic drug names appear.
|
157 |
+
- For the drugs found in the test sets of MedQA and MedMCQA, we found that generic names were much more common than brand names in Dolma.
|
158 |
+
- We found medical benchmark contamination in Dolma. For instance, 99.21% of MedQA test data and 34.13% of MedMCQA test data overlapped with Dolma.
|
159 |
|
160 |
### Results
|
161 |
|
|
|
192 |
|
193 |
### Conclusion
|
194 |
|
195 |
+
The DrugMatchQA task and our Dolma dataset analysis reveal critical weaknesses in how current language models handle drug names. While LLMs offer significant potential in healthcare, they need to be more robust and accurate to ensure reliable patient support. Our findings underscore the importance of addressing dataset contamination and enhancing LLM robustness to meet the stringent demands of medical applications.
|
196 |
'''}
|
197 |
]
|
198 |
|