First of all, thank you for making this. Second, how can I train my own model (and tokenizer) from a custom dataset?
I am having a hard time understanding Huggingface's documentation. I have a text dataset which uses only around 200 different tokens. How would I go about training MegaForCausalLM on my dataset? Thank you for your help.
Hi @Tylersuard - I'd recommend checking out the Hugging Face course for both of those topics. Chapter 6 covers building a custom tokenizer, and Chapter 7 has a section on training a causal LM from scratch. There is also a Causal Language Modeling section in this Hugging Face example notebook which is fine-tuning from a pretrained checkpoint, but the mechanics should be the same except that you'll be starting with a blank model and probably training for longer, with different hyperparameters, etc.
As for MEGA specifically, you should be able to use whichever base tokenizer type you'd like, since there isn't really a MEGA tokenizer -- the one I used here is the same tokenizer used by RoBERTa. When you create your model class, you'll want to set is_decoder=True
and bidirectional=False
in the MegaConfig
for compatibility with autoregressive language modeling.
Thank you! I really appreciate your help. I am going through the tutorials now.
I am running into an error on line 901 of modeling_mega.py, where an additive mask is created from a causal mask.
I am using max_positions of 13008 and a chunk_size of 16, with chunking set to True.
It appears that tensors will go through that function just fine, until eventually a tensor goes through that is only the size of one chunk, and that code is expecting a full sequence tensor of 13008 tokens.
I get this error:
softmax RuntimeError: The size of tensor a (16) must match the size of tensor b (13008) at non-singleton dimension 3
Do you know what is going on here? I went through my dataset and it appears that all samples are of uniform length.
This is what my dataset looks like:
DatasetDict({
test: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2403
})
validation: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2423
})
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 19021
})
})
Thank you for your help.
I'm honestly not sure what's causing this, and it's hard to guess based on the error alone. If all of your samples are of uniform length, then I'm not sure how one batch would have a different number of chunks than the others. The only thing I can think of would be making sure that pad_to_multiple_of
in your tokenizer is set to the chunk_size
.
This would probably also be a good candidate for a GitHub issue with a minimal example to reproduce the behavior