Train a Llama model from scratch

Community Article Published July 29, 2024

This script is deprecated! Many updates to transformers have happened since its release!

In this tutorial, we'll walk through the process of training a language model using the Llama model architecture and the Transformers library.

1. Installing the Required Libraries

We'll start by installing the necessary libraries using pip:

!pip install -q datasets accelerate evaluate trl accelerate transformers jinja2

2. Logging into Hugging Face Hub

Next, we'll log into the Hugging Face Hub to access the required models and datasets:

from huggingface_hub import notebook_login

notebook_login()

3. Loading the Necessary Libraries and Models

We'll import the required libraries and load the Llama model and tokenizer:

this part is pretty complicated, so stay with me.

from datasets import load_dataset

dataset = load_dataset("your_dataset_name", split="train") # load the dataset

Here, we'll get the corpus to past to the tokenizer

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

training_corpus = get_training_corpus()

The base tokenizer is up to you, I'm using a blank one, but a lot of people opt for different ones, such as gpt2.

from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(
    training_corpus,
    vocab_size=3200,
    min_frequency=2,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>", "<|user|>", "<|bot|>", "<|end|>"] # you can pick the last two or three, as you'll see next
)

Next, we'll define the tokenizer special tokens and chat template.

from transformers import PreTrainedTokenizerFast

special_tokens = {
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
tokenizer.add_special_tokens(special_tokens)

tokenizer.user_token_id = tokenizer.convert_tokens_to_ids("<|user|>") # here
tokenizer.assistant_token_id = tokenizer.convert_tokens_to_ids("<|bot|>") # too

chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '<|end|>\n' }}{% elif message['role'] == 'assistant' %}{{ '<|bot|>\n' + message['content'] + '<|end|>\n' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}{{ eos_token }}" # this is where you define the chat template, so you can go crazy here. Something a lot of people do for whatever reason is add seamingly random newline characters

tokenizer.chat_template = chat_template

Now, finally, we'll define the model.

from transformers import LlamaConfig, LlamaForCausalLM

print(tokenizer.apply_chat_template([{"role": "user", "content": "Why is the sky blue?"}, {"role": "assistant", "content": "Due to rayleigh scattering."}], tokenize=False)) # test to see if the chat template worked

config = LlamaConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    intermediate_size=1024,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rms_norm_eps=1e-6,
    initializer_range=0.02,
    use_cache=True,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    tie_word_embeddings=False,
)

model = LlamaForCausalLM(config)

4. Formatting the Dataset

We'll define a function to format the prompts in the dataset and map the dataset:

def format_prompts(examples):
    """
    Define the format for your dataset
    This function should return a dictionary with a 'text' key containing the formatted prompts.
    """
    pass
dataset = dataset.map(format_prompts, batched=True)

print(dataset['text'][2]) # Check to see if the fields were formatted correctly

5. Setting Up the Training Arguments

Define the training args:

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="your_output_dir",
    num_train_epochs=4, # replace this, depending on your dataset
    per_device_train_batch_size=16,
    learning_rate=1e-4,
    optim="sgd" # sgd, my beloved
)

6. Creating the Trainer

We'll create an instance of the SFTTrainer from the trl library:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=dataset,
    dataset_text_field='text',
    max_seq_length=512
)

7. Training the Model

Finally, we'll start the training process:

trainer.train()

8. Pushing the Trained Model to Hugging Face Hub

After the training is complete, you can push the trained model to the Hugging Face Hub using the following command:

trainer.push_to_hub()

This will upload the model to your Hugging Face Hub account, making it available for future use or sharing.

That's it!