metadata

language:
  - en
tags:
  - llama
  - llm
  - fine-tuning
  - fill-in-the-middle
  - instruction-following
license: apache-2.0
datasets:
  - mlabonne/FineTome-100k
  - mlfoundations/dclm-baseline-1.0-parquet
  - wikimedia/wikipedia
  - bigcode/starcoderdata
pipeline_tag: text-generation

Custom LLM with Full Fine-Tuning

Model Overview

This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.

Developer: Eric Florenzano
Model Type: Large Language Model (LLM)
Language(s): English, with a focus on Python for code-related tasks
License: Apache-2.0
Base Model: meta-llama/Meta-Llama-3.1-8B

Unique Training Approach

This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:

Full Fine-Tuning: Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
Diverse Dataset Mixture: Combines pretraining and instruction datasets for comprehensive language understanding.
Multi-Format Instruction Tuning: Alternates between ChatML and Llama Chat templates for flexible instruction-following.
Contextual Data Prefixing: Uses source information to address data imbalance during training.
Fill-in-the-Middle (FIM) Training: Incorporates FIM tasks for enhanced context understanding.

Training Data

The model is trained on a blend of high-quality data sources:

FineTome-100k: High-quality instruction-tuned data for general language tasks.
dclm-baseline-1.0-parquet: Apple's pretraining corpus for text completion/prediction.
English, Spanish, and French Wikipedia: For broad language understanding.
Starcoder: High-quality Python-focused code dataset for code completion tasks.

Training Procedure

Setup

pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn

Key Features

Full Fine-Tuning: Updates all model parameters for comprehensive learning.
8-bit AdamW Optimizer: Uses adamw_bnb_8bit for memory-efficient training.
Flash Attention 2: Implements flash_attention_2 for faster training.
Gradient Checkpointing: Enables training with limited GPU memory.
Liger and Packing: Utilizes use_liger=true and packing=true for efficient data handling.
BFloat16 Precision: Uses bfloat16 for balanced precision and performance.

Advanced Training Techniques

This model incorporates several advanced training techniques to enhance its capabilities:

1. Fill-in-the-Middle (FIM) Capability

FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.

Using FIM with the Model

To use the FIM capability, structure your input with special tokens:

<|fim_start|>: Marks the start of the FIM input
<|fim_marker|>: Separates the prefix from the suffix
<|fim_gen|>: Indicates where the generated content should begin
<|fim_end|>: Marks the end of the FIM input

Example FIM input:

<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>

The model will generate content to replace <|fim_gen|>, filling in the middle between the prefix and suffix.

2. Reverse Prediction and Instruction Backtranslation

This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.

How it works:

The model is given a snippet of conversation or text.
It's then tasked with predicting what came before this snippet.
This process helps the model understand context, conversation flow, and logical progression of ideas.

Benefits:

Improved context understanding
Enhanced ability to maintain coherent, contextually appropriate conversations
Better grasp of cause-and-effect relationships in text

Example use case:

Input:

Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?

Task: Predict the previous exchange in this conversation.

Possible model output:

Human: What's the capital of France?
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?

3. Meta-FIM

Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.

Benefits:

Enhanced understanding of long-range dependencies in text
Improved ability to maintain coherence across longer contexts
Better performance on tasks requiring integration of information from multiple parts of a document or conversation

Example:

<|fim_start|>Human: What's the weather like today?
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
1. Eiffel Tower
2. Louvre Museum
3. Notre-Dame Cathedral
4. Arc de Triomphe
5. Sacré-Cœur Basilica<|fim_end|>

In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.

Evaluation

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
tinyBenchmarks	N/A
- tinyArc	0	none	25	acc_norm	↑	0.5821	±	N/A
- tinyGSM8k	0	flexible-extract	5	exact_match	↑	0.4989	±	N/A
		strict-match	5	exact_match	↑	0.4867	±	N/A
- tinyHellaswag	0	none	10	acc_norm	↑	0.8307	±	N/A
- tinyMMLU	0	none	0	acc_norm	↑	0.6651	±	N/A
- tinyTruthfulQA	0	none	0	acc	↑	0.4991	±	N/A
- tinyWinogrande	0	none	5	acc_norm	↑	0.7558	±	N/A

Training Command

python sft_14.py \
    --run_name="llama3.1-8b-continued3" \
    --model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
    --dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
    --report_to="wandb" \
    --optim="adamw_bnb_8bit" \
    --lr_scheduler_type="cosine" \
    --max_steps=100000 \
    --max_seq_length=64000 \
    --learning_rate=0.00001 \
    --attn_implementation="flash_attention_2" \
    --save_strategy="steps" \
    --save_steps 50 \
    --save_total_limit=10 \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=8 \
    --logging_steps=1 \
    --num_train_epochs=1 \
    --push_to_hub \
    --hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
    --hub_strategy="all_checkpoints" \
    --gradient_checkpointing \
    --use_liger=true \
    --packing=true \
    --torch_dtype="bfloat16" \
    --output_dir="continuedtraining3_output"

Intended Uses

This model is designed for:

Text Completion and Generation
Code Completion (especially Python)
Instruction Following
General Language Understanding
Context-Aware Text Infilling (using FIM)

Limitations and Biases

The model may exhibit biases present in the training data.
It lacks real-time knowledge beyond its training data.
Should not be used for critical decision-making without human oversight.

Technical Specifications

Base Model: meta-llama/Meta-Llama-3.1-8B
Training Approach: Full Fine-Tuning
Library: Hugging Face Transformers and TRL

Contact

For inquiries about this model, please contact Eric Florenzano through the model repository.