dsrestrepo commited on
Commit
9c9f50c
1 Parent(s): 63935a7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Details
2
+
3
+ #### Model Name: NumericBERT
4
+
5
+ #### Model Type: Transformer
6
+
7
+ #### Architecture: BERT
8
+
9
+ #### Training Method: Masked Language Modeling (MLM)
10
+
11
+ #### Training Data: MIMIC IV Lab values data
12
+
13
+ #### Training Hyperparameters:
14
+
15
+ - **Optimizer:** AdamW
16
+ - **Learning Rate:** 5e-5
17
+ - **Masking Rate:** 20%
18
+ - **Tokenization:** Custom numeric-to-text mapping using the TextEncoder class
19
+
20
+ ### Text Encoding Process
21
+
22
+ **Overview:** Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters.
23
+
24
+ **Normalization and Binning:**
25
+ - **Method:** Log normalization and splitting into 10 bins.
26
+ - **Representation:** Each bin is represented by a letter (A-J).
27
+
28
+ ### Token Construction:
29
+
30
+ - **Format:** `<<lab_id_token>><<lab_value_bin>>`
31
+ - **Example:** For a lab value of type 'Bic' with a normalized value in bin 'C', the token might be `BicC`.
32
+ - **Columns Used:** 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'.
33
+
34
+ ### Training Data Preprocessing
35
+
36
+ - **Column Selection:** Numerical values from selected lab values.
37
+ - **Text Encoding:** Numeric values are encoded into text using the process described above.
38
+ - **Masking:** 20% of the data is randomly masked during training.
39
+
40
+ ### Model Output
41
+
42
+ - **Description:** Outputs predictions for masked values during training.
43
+ - **Format:** Contains the encoded text representing the predicted lab values.
44
+
45
+ ### Limitations and Considerations
46
+
47
+ - **Numeric Data Representation:** The custom text representation may have limitations in capturing the intricacies of the original numeric data.
48
+ - **Training Data Source:** Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset.
49
+ - **Generalizability:** The model's effectiveness outside the context of the training dataset is not guaranteed.
50
+
51
+ ### Contact Information
52
+
53
+ - **Email:** davidres@mit.edu
54
+ - **Name:** David Restrepo
55
+ - **Affiliation:** MIT Critical Data - MIT