|
--- |
|
language: |
|
- en |
|
tags: |
|
- llama |
|
- llm |
|
- fine-tuning |
|
- fill-in-the-middle |
|
- instruction-following |
|
license: apache-2.0 |
|
datasets: |
|
- mlabonne/FineTome-100k |
|
- mlfoundations/dclm-baseline-1.0-parquet |
|
- wikimedia/wikipedia |
|
- bigcode/starcoderdata |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Custom LLM with Full Fine-Tuning |
|
|
|
## Model Overview |
|
|
|
This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks. |
|
|
|
- **Developer:** Eric Florenzano |
|
- **Model Type:** Large Language Model (LLM) |
|
- **Language(s):** English, with a focus on Python for code-related tasks |
|
- **License:** Apache-2.0 |
|
- **Base Model:** meta-llama/Meta-Llama-3.1-8B |
|
|
|
## Unique Training Approach |
|
|
|
This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include: |
|
|
|
- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters. |
|
- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding. |
|
- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following. |
|
- **Contextual Data Prefixing:** Uses source information to address data imbalance during training. |
|
- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding. |
|
|
|
## Training Data |
|
|
|
The model is trained on a blend of high-quality data sources: |
|
|
|
- **FineTome-100k:** High-quality instruction-tuned data for general language tasks. |
|
- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction. |
|
- **English, Spanish, and French Wikipedia:** For broad language understanding. |
|
- **Starcoder:** High-quality Python-focused code dataset for code completion tasks. |
|
|
|
## Training Procedure |
|
|
|
### Setup |
|
|
|
```bash |
|
pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn |
|
``` |
|
|
|
## Key Features |
|
|
|
1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning. |
|
2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training. |
|
3. **Flash Attention 2:** Implements `flash_attention_2` for faster training. |
|
4. **Gradient Checkpointing:** Enables training with limited GPU memory. |
|
5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling. |
|
6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance. |
|
|
|
## Advanced Training Techniques |
|
|
|
This model incorporates several advanced training techniques to enhance its capabilities: |
|
|
|
### 1. Fill-in-the-Middle (FIM) Capability |
|
|
|
FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation. |
|
|
|
#### Using FIM with the Model |
|
|
|
To use the FIM capability, structure your input with special tokens: |
|
|
|
- `<|fim_start|>`: Marks the start of the FIM input |
|
- `<|fim_marker|>`: Separates the prefix from the suffix |
|
- `<|fim_gen|>`: Indicates where the generated content should begin |
|
- `<|fim_end|>`: Marks the end of the FIM input |
|
|
|
Example FIM input: |
|
``` |
|
<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|> |
|
``` |
|
|
|
The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix. |
|
|
|
### 2. Reverse Prediction and Instruction Backtranslation |
|
|
|
This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation. |
|
|
|
#### How it works: |
|
1. The model is given a snippet of conversation or text. |
|
2. It's then tasked with predicting what came before this snippet. |
|
3. This process helps the model understand context, conversation flow, and logical progression of ideas. |
|
|
|
#### Benefits: |
|
- Improved context understanding |
|
- Enhanced ability to maintain coherent, contextually appropriate conversations |
|
- Better grasp of cause-and-effect relationships in text |
|
|
|
#### Example use case: |
|
Input: |
|
``` |
|
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there? |
|
``` |
|
Task: Predict the previous exchange in this conversation. |
|
|
|
Possible model output: |
|
``` |
|
Human: What's the capital of France? |
|
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks. |
|
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there? |
|
``` |
|
|
|
### 3. Meta-FIM |
|
|
|
Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts. |
|
|
|
#### Benefits: |
|
- Enhanced understanding of long-range dependencies in text |
|
- Improved ability to maintain coherence across longer contexts |
|
- Better performance on tasks requiring integration of information from multiple parts of a document or conversation |
|
|
|
#### Example: |
|
``` |
|
<|fim_start|>Human: What's the weather like today? |
|
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France. |
|
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information? |
|
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks? |
|
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones: |
|
1. Eiffel Tower |
|
2. Louvre Museum |
|
3. Notre-Dame Cathedral |
|
4. Arc de Triomphe |
|
5. Sacré-Cœur Basilica<|fim_end|> |
|
``` |
|
|
|
In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points. |
|
|
|
## Evaluation |
|
|
|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |
|
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------| |
|
|tinyBenchmarks | N/A| | | | | | | | |
|
| - tinyArc | 0|none | 25|acc_norm |↑ |0.5821|± | N/A| |
|
| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.4989|± | N/A| |
|
| | |strict-match | 5|exact_match|↑ |0.4867|± | N/A| |
|
| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.8307|± | N/A| |
|
| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.6651|± | N/A| |
|
| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4991|± | N/A| |
|
| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.7558|± | N/A| |
|
|
|
### Training Command |
|
|
|
```bash |
|
python sft_14.py \ |
|
--run_name="llama3.1-8b-continued2" \ |
|
--model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \ |
|
--dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \ |
|
--report_to="wandb" \ |
|
--optim="adamw_bnb_8bit" \ |
|
--lr_scheduler_type="cosine" \ |
|
--max_steps=100000 \ |
|
--max_seq_length=64000 \ |
|
--learning_rate=0.00001 \ |
|
--attn_implementation="flash_attention_2" \ |
|
--save_strategy="steps" \ |
|
--save_steps 50 \ |
|
--save_total_limit=10 \ |
|
--per_device_train_batch_size=1 \ |
|
--per_device_eval_batch_size=1 \ |
|
--gradient_accumulation_steps=8 \ |
|
--logging_steps=1 \ |
|
--num_train_epochs=1 \ |
|
--push_to_hub \ |
|
--hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining2-FFT" \ |
|
--hub_strategy="all_checkpoints" \ |
|
--gradient_checkpointing \ |
|
--use_liger=true \ |
|
--packing=true \ |
|
--torch_dtype="bfloat16" \ |
|
--output_dir="continuedtraining2_output" |
|
``` |
|
|
|
## Intended Uses |
|
|
|
This model is designed for: |
|
|
|
- Text Completion and Generation |
|
- Code Completion (especially Python) |
|
- Instruction Following |
|
- General Language Understanding |
|
- Context-Aware Text Infilling (using FIM) |
|
|
|
## Limitations and Biases |
|
|
|
- The model may exhibit biases present in the training data. |
|
- It lacks real-time knowledge beyond its training data. |
|
- Should not be used for critical decision-making without human oversight. |
|
|
|
## Technical Specifications |
|
|
|
- **Base Model:** meta-llama/Meta-Llama-3.1-8B |
|
- **Training Approach:** Full Fine-Tuning |
|
- **Library:** Hugging Face Transformers and TRL |
|
|
|
## Contact |
|
|
|
For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co/ericflo/Llama-3.1-8B-ContinuedTraining2-FFT). |