TheBlueScrubs
/

ModernBERT-base-TBS

@@ -18,7 +18,8 @@ tags:
 # ModernBERT Medical Safety Classifier
-The ModernBERT Medical Safety Classifier is a transformer-based language model fine-tuned to assess the safety and ethical standards of medical texts, particularly in the oncology domain. Built on top of the ModernBERT architecture, it leverages the powerful evaluations of Llama 3.1 (70B) to distill that model’s safety and ethical insights into a significantly smaller and faster classifier. Specifically, it was trained on approximately 9 billion tokens of cancer-related data from The Blue Scrubs dataset, each of which had been annotated by Llama 3.1 (70B) for safety and ethical adherence. By transferring these large-model evaluations into ModernBERT, the resulting classifier retains robust predictive accuracy while remaining lightweight enough for real-time or resource-constrained inference.
 ## Model Details
 - **Developed by**: TheBlueScrubs
@@ -68,7 +69,14 @@ print(f"Safety Score: {safety_score}")
 ## Training Data
-The model was fine-tuned on approximately 9 billion tokens of cancer-specific texts extracted from The Blue Scrubs dataset. Each document was annotated by Llama 3.1 70B Instruct for Safety and Ethical Standards, yielding continuous scores ranging from 1 (least safe) to 5 (most safe). These scores served as regression targets during training.
 ## Training Procedure
@@ -77,14 +85,16 @@ The model was fine-tuned on approximately 9 billion tokens of cancer-specific te
 Texts were tokenized using the ModernBERT tokenizer with a maximum sequence length of 4,096 tokens. No additional filtering was applied, as the data was considered trustworthy.
 ### Training Hyperparameters
-- **Learning Rate**: 1e-4
-- **Batch Size**: 20 (per device)
-- **Gradient Accumulation Steps**: 8
-- **Optimizer**: AdamW
-- **Weight Decay**: 0.01
-- **FP16 Training**: Enabled
-- **Total Training Steps**: Calculated to approximate 3 epochs over the dataset
 ## Evaluation
@@ -100,15 +110,16 @@ The model's performance was evaluated on an out-of-sample test set comprising ca
 ### Results
-- **MSE**: 0.189
-- **Accuracy**: 0.993
 - **ROC Analysis**: Demonstrated robust classification capability with high True Positive Rates and low False Positive Rates.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66eb0a4e55940cd564ad8e0a/q8_uD5osME7yyGDU2RSIb.png)
 ## Bias, Risks, and Limitations
-The model's training data is specific to cancer-related medical texts, which may introduce biases toward oncology terminology and contexts. Its performance on other medical domains has not been assessed, and users should be cautious when applying the model outside its trained scope.
 ## Recommendations

 # ModernBERT Medical Safety Classifier
+The ModernBERT Medical Safety Classifier is a transformer-based language model fine-tuned to assess the safety and ethical standards of medical texts across diverse medical domains. Built on top of the ModernBERT architecture, it leverages the powerful evaluations of Llama 3.1 (70B) to distill that model’s safety and ethical insights into a significantly smaller and faster classifier. Specifically, it was trained on a newly curated, balanced subset of The Blue Scrubs dataset (a total of 83,636 documents), each annotated by Llama 3.1 (70B) for safety and ethical adherence. By transferring these large-model evaluations into ModernBERT, the resulting classifier retains robust predictive accuracy while remaining lightweight enough for real-time or resource-constrained inference.
 ## Model Details
 - **Developed by**: TheBlueScrubs
 ## Training Data
+**Replace with** (updated text):
+> The model was re-trained on a **new, balanced subset** drawn from The Blue Scrubs dataset to address the overrepresentation of high-safety texts. Specifically:
+>
+> - We scanned a total of 11,500,608 rows across all files and removed 112,330 rows for parse/NaN/0/out-of-range issues, leaving 11,388,278 valid rows.
+> - Of these valid rows, 41,818 had a safety score ≤ 2, while 11,346,460 had a safety score > 2.
+> - To balance the dataset, we randomly sampled documents so that unsafe (≤ 2) and safer (> 2) texts were equally represented. This yielded a final balanced set of **83,636 total rows**.
+>
+> Each row retained its original continuous safety score from Llama 3.1 (70B), ranging from 1 (least safe) to 5 (most safe). These scores again served as regression targets during training.
 ## Training Procedure
 Texts were tokenized using the ModernBERT tokenizer with a maximum sequence length of 4,096 tokens. No additional filtering was applied, as the data was considered trustworthy.
 ### Training Hyperparameters
+> **Learning Rate**: 2e-5
+> **Number of Epochs**: 5
+> **Batch Size**: 20 (per device)
+> **Gradient Accumulation Steps**: 8
+> **Optimizer**: AdamW
+> **Weight Decay**: 0.01
+> **FP16 Training**: Enabled
+> **Total Training Steps**: Now ~5 epochs over the final balanced set
+>
+> All other hyperparameter settings (e.g., batch size, optimizer choice) remained the same as in the previous training. Only the learning rate, the number of epochs, and the balanced dataset were changed.
 ## Evaluation
 ### Results
+- **MSE**: 0.489
+- **RMSE**: 0.699
+- **Accuracy**: 0.9642
 - **ROC Analysis**: Demonstrated robust classification capability with high True Positive Rates and low False Positive Rates.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/66eb0a4e55940cd564ad8e0a/_WNI7uA5ykzb67s1opgJu.png)
 ## Bias, Risks, and Limitations
+This model was trained on a curated subset of The Blue Scrubs dataset encompassing various medical domains, yet some areas may remain underrepresented. As with any model, there is a risk of bias stemming from data composition, and users should exercise caution when applying the classifier, especially in highly specialized contexts. Outputs should always be corroborated with expert opinion and current clinical guidelines to ensure safe, accurate medical usage.
 ## Recommendations