ericflo
/

Llama-3.1-8B-ContinuedTraining2-FFT

+---
+language:
+  - en
+tags:
+  - llama
+  - llm
+  - fine-tuning
+  - fill-in-the-middle
+  - instruction-following
+license: apache-2.0
+datasets:
+  - mlabonne/FineTome-100k
+  - mlfoundations/dclm-baseline-1.0-parquet
+  - wikimedia/wikipedia
+  - bigcode/starcoderdata
+pipeline_tag: text-generation
+---
+# Custom LLM with Full Fine-Tuning
+## Model Overview
+This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
+- **Developer:** Eric Florenzano
+- **Model Type:** Large Language Model (LLM)
+- **Language(s):** English, with a focus on Python for code-related tasks
+- **License:** Apache-2.0
+- **Base Model:** meta-llama/Meta-Llama-3.1-8B
+## Unique Training Approach
+This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
+- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
+- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
+- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
+- **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
+- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.
+## Training Data
+The model is trained on a blend of high-quality data sources:
+- **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
+- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
+- **English, Spanish, and French Wikipedia:** For broad language understanding.
+- **Starcoder:** High-quality Python-focused code dataset for code completion tasks.
+## Training Procedure
+### Setup
+```bash
+pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
+```
+## Key Features
+1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
+2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
+3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
+4. **Gradient Checkpointing:** Enables training with limited GPU memory.
+5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
+6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.
+## Advanced Training Techniques
+This model incorporates several advanced training techniques to enhance its capabilities:
+### 1. Fill-in-the-Middle (FIM) Capability
+FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
+#### Using FIM with the Model
+To use the FIM capability, structure your input with special tokens:
+- `<|fim_start|>`: Marks the start of the FIM input
+- `<|fim_marker|>`: Separates the prefix from the suffix
+- `<|fim_gen|>`: Indicates where the generated content should begin
+- `<|fim_end|>`: Marks the end of the FIM input
+Example FIM input:
+```
+<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
+```
+The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.
+### 2. Reverse Prediction and Instruction Backtranslation
+This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
+#### How it works:
+1. The model is given a snippet of conversation or text.
+2. It's then tasked with predicting what came before this snippet.
+3. This process helps the model understand context, conversation flow, and logical progression of ideas.
+#### Benefits:
+- Improved context understanding
+- Enhanced ability to maintain coherent, contextually appropriate conversations
+- Better grasp of cause-and-effect relationships in text
+#### Example use case:
+Input:
+```
+Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
+```
+Task: Predict the previous exchange in this conversation.
+Possible model output:
+```
+Human: What's the capital of France?
+Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
+Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
+```
+### 3. Meta-FIM
+Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
+#### Benefits:
+- Enhanced understanding of long-range dependencies in text
+- Improved ability to maintain coherence across longer contexts
+- Better performance on tasks requiring integration of information from multiple parts of a document or conversation
+#### Example:
+```
+<|fim_start|>Human: What's the weather like today?
+Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
+Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
+Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
+Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
+1. Eiffel Tower
+2. Louvre Museum
+3. Notre-Dame Cathedral
+4. Arc de Triomphe
+5. Sacré-Cœur Basilica<|fim_end|>
+```
+In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
+## Evaluation
+|      Tasks      |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
+|tinyBenchmarks   |    N/A|                |      |           |   |      |   |      |
+| - tinyArc       |      0|none            |    25|acc_norm   |↑  |0.5791|±  |   N/A|
+| - tinyGSM8k     |      0|flexible-extract|     5|exact_match|↑  |0.5536|±  |   N/A|
+|                 |       |strict-match    |     5|exact_match|↑  |0.5536|±  |   N/A|
+| - tinyHellaswag |      0|none            |    10|acc_norm   |↑  |0.8391|±  |   N/A|
+| - tinyMMLU      |      0|none            |     0|acc_norm   |↑  |0.6377|±  |   N/A|
+| - tinyTruthfulQA|      0|none            |     0|acc        |↑  |0.4914|±  |   N/A|
+| - tinyWinogrande|      0|none            |     5|acc_norm   |↑  |0.7608|±  |   N/A|
+### Training Command
+```bash
+python sft_14.py \
+    --run_name="llama3.1-8b-continued3" \
+    --model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
+    --dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
+    --report_to="wandb" \
+    --optim="adamw_bnb_8bit" \
+    --lr_scheduler_type="cosine" \
+    --max_steps=100000 \
+    --max_seq_length=64000 \
+    --learning_rate=0.00001 \
+    --attn_implementation="flash_attention_2" \
+    --save_strategy="steps" \
+    --save_steps 50 \
+    --save_total_limit=10 \
+    --per_device_train_batch_size=1 \
+    --per_device_eval_batch_size=1 \
+    --gradient_accumulation_steps=8 \
+    --logging_steps=1 \
+    --num_train_epochs=1 \
+    --push_to_hub \
+    --hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
+    --hub_strategy="all_checkpoints" \
+    --gradient_checkpointing \
+    --use_liger=true \
+    --packing=true \
+    --torch_dtype="bfloat16" \
+    --output_dir="continuedtraining3_output"
+```
+## Intended Uses
+This model is designed for:
+- Text Completion and Generation
+- Code Completion (especially Python)
+- Instruction Following
+- General Language Understanding
+- Context-Aware Text Infilling (using FIM)
+## Limitations and Biases
+- The model may exhibit biases present in the training data.
+- It lacks real-time knowledge beyond its training data.
+- Should not be used for critical decision-making without human oversight.
+## Technical Specifications
+- **Base Model:** meta-llama/Meta-Llama-3.1-8B
+- **Training Approach:** Full Fine-Tuning
+- **Library:** Hugging Face Transformers and TRL
+## Contact
+For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co/ericflo/Llama-3.1-8B-ContinuedTraining3-FFT).