Create README.md

9c9f50c verified 8 months ago

No virus

1.97 kB

	# Model Details

	#### Model Name: NumericBERT

	#### Model Type: Transformer

	#### Architecture: BERT

	#### Training Method: Masked Language Modeling (MLM)

	#### Training Data: MIMIC IV Lab values data

	#### Training Hyperparameters:

	- Optimizer: AdamW
	- Learning Rate: 5e-5
	- Masking Rate: 20%
	- Tokenization: Custom numeric-to-text mapping using the TextEncoder class

	### Text Encoding Process

	Overview: Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters.

	Normalization and Binning:
	- Method: Log normalization and splitting into 10 bins.
	- Representation: Each bin is represented by a letter (A-J).

	### Token Construction:

	- Format: `<<lab_id_token>><<lab_value_bin>>`
	- Example: For a lab value of type 'Bic' with a normalized value in bin 'C', the token might be `BicC`.
	- Columns Used: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'.

	### Training Data Preprocessing

	- Column Selection: Numerical values from selected lab values.
	- Text Encoding: Numeric values are encoded into text using the process described above.
	- Masking: 20% of the data is randomly masked during training.

	### Model Output

	- Description: Outputs predictions for masked values during training.
	- Format: Contains the encoded text representing the predicted lab values.

	### Limitations and Considerations

	- Numeric Data Representation: The custom text representation may have limitations in capturing the intricacies of the original numeric data.
	- Training Data Source: Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset.
	- Generalizability: The model's effectiveness outside the context of the training dataset is not guaranteed.

	### Contact Information

	- Email: davidres@mit.edu
	- Name: David Restrepo
	- Affiliation: MIT Critical Data - MIT

	# Model Details

	#### Model Name: NumericBERT

	#### Model Type: Transformer

	#### Architecture: BERT

	#### Training Method: Masked Language Modeling (MLM)

	#### Training Data: MIMIC IV Lab values data

	#### Training Hyperparameters:

	- Optimizer: AdamW
	- Learning Rate: 5e-5
	- Masking Rate: 20%
	- Tokenization: Custom numeric-to-text mapping using the TextEncoder class

	### Text Encoding Process

	Overview: Non-negative integers are converted into uppercase letter-based representations, allowing numerical values to be expressed as sequences of letters.

	Normalization and Binning:
	- Method: Log normalization and splitting into 10 bins.
	- Representation: Each bin is represented by a letter (A-J).

	### Token Construction:

	- Format: `<<lab_id_token>><<lab_value_bin>>`
	- Example: For a lab value of type 'Bic' with a normalized value in bin 'C', the token might be `BicC`.
	- Columns Used: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'.

	### Training Data Preprocessing

	- Column Selection: Numerical values from selected lab values.
	- Text Encoding: Numeric values are encoded into text using the process described above.
	- Masking: 20% of the data is randomly masked during training.

	### Model Output

	- Description: Outputs predictions for masked values during training.
	- Format: Contains the encoded text representing the predicted lab values.

	### Limitations and Considerations

	- Numeric Data Representation: The custom text representation may have limitations in capturing the intricacies of the original numeric data.
	- Training Data Source: Performance may be influenced by the characteristics and biases inherent in the MIMIC IV dataset.
	- Generalizability: The model's effectiveness outside the context of the training dataset is not guaranteed.

	### Contact Information

	- Email: davidres@mit.edu
	- Name: David Restrepo
	- Affiliation: MIT Critical Data - MIT