--- language: ms --- # xlnet-large-bahasa-cased Pretrained XLNET large language model for Malay. ## Pretraining Corpus `xlnet-large-bahasa-cased` model was pretrained on ~1.4 Billion words. Below is list of data we trained on, 1. [cleaned local texts](https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean). 2. [translated The Pile](https://github.com/huseinzol05/malay-dataset/tree/master/corpus/pile). ## Pretraining details - All steps can reproduce from here, [Malaya/pretrained-model/xlnet](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet). ## Load Pretrained Model You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this: ```python from transformers import XLNetModel, XLNetTokenizer model = XLNetModel.from_pretrained('malay-huggingface/xlnet-large-bahasa-cased') tokenizer = XLNetTokenizer.from_pretrained( 'malay-huggingface/xlnet-large-bahasa-cased', do_lower_case = False, ) ```