First of all, thank you for making this. Second, how can I train my own model (and tokenizer) from a custom dataset?

by Tylersuard - opened Apr 21, 2023

Apr 21, 2023

I am having a hard time understanding Huggingface's documentation. I have a text dataset which uses only around 200 different tokens. How would I go about training MegaForCausalLM on my dataset? Thank you for your help.

mnaylor

Owner Apr 21, 2023

Hi @Tylersuard - I'd recommend checking out the Hugging Face course for both of those topics. Chapter 6 covers building a custom tokenizer, and Chapter 7 has a section on training a causal LM from scratch. There is also a Causal Language Modeling section in this Hugging Face example notebook which is fine-tuning from a pretrained checkpoint, but the mechanics should be the same except that you'll be starting with a blank model and probably training for longer, with different hyperparameters, etc.

As for MEGA specifically, you should be able to use whichever base tokenizer type you'd like, since there isn't really a MEGA tokenizer -- the one I used here is the same tokenizer used by RoBERTa. When you create your model class, you'll want to set is_decoder=True and bidirectional=False in the MegaConfig for compatibility with autoregressive language modeling.

Tylersuard

Apr 22, 2023

Thank you! I really appreciate your help. I am going through the tutorials now.

Tylersuard changed discussion status to closed Apr 22, 2023

Tylersuard

Apr 22, 2023

I am running into an error on line 901 of modeling_mega.py, where an additive mask is created from a causal mask.

I am using max_positions of 13008 and a chunk_size of 16, with chunking set to True.

It appears that tensors will go through that function just fine, until eventually a tensor goes through that is only the size of one chunk, and that code is expecting a full sequence tensor of 13008 tokens.

I get this error:
softmax RuntimeError: The size of tensor a (16) must match the size of tensor b (13008) at non-singleton dimension 3

Do you know what is going on here? I went through my dataset and it appears that all samples are of uniform length.

This is what my dataset looks like:

DatasetDict({
test: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2403
})
validation: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2423
})
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 19021
})
})

Thank you for your help.

Tylersuard changed discussion status to open Apr 22, 2023

mnaylor

Owner Apr 24, 2023

I'm honestly not sure what's causing this, and it's hard to guess based on the error alone. If all of your samples are of uniform length, then I'm not sure how one batch would have a different number of chunks than the others. The only thing I can think of would be making sure that pad_to_multiple_of in your tokenizer is set to the chunk_size.

This would probably also be a good candidate for a GitHub issue with a minimal example to reproduce the behavior

mnaylor changed discussion status to closed Nov 30, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment