language:
- en
tags:
- llama
- llm
- fine-tuning
- fill-in-the-middle
- instruction-following
license: apache-2.0
datasets:
- mlabonne/FineTome-100k
- mlfoundations/dclm-baseline-1.0-parquet
- wikimedia/wikipedia
- bigcode/starcoderdata
pipeline_tag: text-generation
Custom LLM with Full Fine-Tuning
Model Overview
This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
- Developer: Eric Florenzano
- Model Type: Large Language Model (LLM)
- Language(s): English, with a focus on Python for code-related tasks
- License: Apache-2.0
- Base Model: meta-llama/Meta-Llama-3.1-8B
Unique Training Approach
This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
- Full Fine-Tuning: Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
- Diverse Dataset Mixture: Combines pretraining and instruction datasets for comprehensive language understanding.
- Multi-Format Instruction Tuning: Alternates between ChatML and Llama Chat templates for flexible instruction-following.
- Contextual Data Prefixing: Uses source information to address data imbalance during training.
- Fill-in-the-Middle (FIM) Training: Incorporates FIM tasks for enhanced context understanding.
Training Data
The model is trained on a blend of high-quality data sources:
- FineTome-100k: High-quality instruction-tuned data for general language tasks.
- dclm-baseline-1.0-parquet: Apple's pretraining corpus for text completion/prediction.
- English, Spanish, and French Wikipedia: For broad language understanding.
- Starcoder: High-quality Python-focused code dataset for code completion tasks.
Training Procedure
Setup
pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
Key Features
- Full Fine-Tuning: Updates all model parameters for comprehensive learning.
- 8-bit AdamW Optimizer: Uses
adamw_bnb_8bit
for memory-efficient training. - Flash Attention 2: Implements
flash_attention_2
for faster training. - Gradient Checkpointing: Enables training with limited GPU memory.
- Liger and Packing: Utilizes
use_liger=true
andpacking=true
for efficient data handling. - BFloat16 Precision: Uses
bfloat16
for balanced precision and performance.
Advanced Training Techniques
This model incorporates several advanced training techniques to enhance its capabilities:
1. Fill-in-the-Middle (FIM) Capability
FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
Using FIM with the Model
To use the FIM capability, structure your input with special tokens:
<|fim_start|>
: Marks the start of the FIM input<|fim_marker|>
: Separates the prefix from the suffix<|fim_gen|>
: Indicates where the generated content should begin<|fim_end|>
: Marks the end of the FIM input
Example FIM input:
<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
The model will generate content to replace <|fim_gen|>
, filling in the middle between the prefix and suffix.
2. Reverse Prediction and Instruction Backtranslation
This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
How it works:
- The model is given a snippet of conversation or text.
- It's then tasked with predicting what came before this snippet.
- This process helps the model understand context, conversation flow, and logical progression of ideas.
Benefits:
- Improved context understanding
- Enhanced ability to maintain coherent, contextually appropriate conversations
- Better grasp of cause-and-effect relationships in text
Example use case:
Input:
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
Task: Predict the previous exchange in this conversation.
Possible model output:
Human: What's the capital of France?
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
3. Meta-FIM
Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
Benefits:
- Enhanced understanding of long-range dependencies in text
- Improved ability to maintain coherence across longer contexts
- Better performance on tasks requiring integration of information from multiple parts of a document or conversation
Example:
<|fim_start|>Human: What's the weather like today?
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
1. Eiffel Tower
2. Louvre Museum
3. Notre-Dame Cathedral
4. Arc de Triomphe
5. Sacré-Cœur Basilica<|fim_end|>
In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
Evaluation
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.5821 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.4989 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.4867 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.8307 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6651 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4991 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.7558 | ± | N/A |
Training Command
python sft_14.py \
--run_name="llama3.1-8b-continued3" \
--model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
--dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
--report_to="wandb" \
--optim="adamw_bnb_8bit" \
--lr_scheduler_type="cosine" \
--max_steps=100000 \
--max_seq_length=64000 \
--learning_rate=0.00001 \
--attn_implementation="flash_attention_2" \
--save_strategy="steps" \
--save_steps 50 \
--save_total_limit=10 \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=8 \
--logging_steps=1 \
--num_train_epochs=1 \
--push_to_hub \
--hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
--hub_strategy="all_checkpoints" \
--gradient_checkpointing \
--use_liger=true \
--packing=true \
--torch_dtype="bfloat16" \
--output_dir="continuedtraining3_output"
Intended Uses
This model is designed for:
- Text Completion and Generation
- Code Completion (especially Python)
- Instruction Following
- General Language Understanding
- Context-Aware Text Infilling (using FIM)
Limitations and Biases
- The model may exhibit biases present in the training data.
- It lacks real-time knowledge beyond its training data.
- Should not be used for critical decision-making without human oversight.
Technical Specifications
- Base Model: meta-llama/Meta-Llama-3.1-8B
- Training Approach: Full Fine-Tuning
- Library: Hugging Face Transformers and TRL
Contact
For inquiries about this model, please contact Eric Florenzano through the model repository.