Compare different tokenizers in char-level and byte-level.
BERT as language model
Knowledge-injected Pre-trained Language Model
Generating synthetic data via self-chatting