Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- llama
|
6 |
+
- llm
|
7 |
+
- fine-tuning
|
8 |
+
- fill-in-the-middle
|
9 |
+
- instruction-following
|
10 |
+
license: apache-2.0
|
11 |
+
datasets:
|
12 |
+
- mlabonne/FineTome-100k
|
13 |
+
- mlfoundations/dclm-baseline-1.0-parquet
|
14 |
+
- wikimedia/wikipedia
|
15 |
+
- bigcode/starcoderdata
|
16 |
+
pipeline_tag: text-generation
|
17 |
+
---
|
18 |
+
|
19 |
+
# Custom LLM with Full Fine-Tuning
|
20 |
+
|
21 |
+
## Model Overview
|
22 |
+
|
23 |
+
This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
|
24 |
+
|
25 |
+
- **Developer:** Eric Florenzano
|
26 |
+
- **Model Type:** Large Language Model (LLM)
|
27 |
+
- **Language(s):** English, with a focus on Python for code-related tasks
|
28 |
+
- **License:** Apache-2.0
|
29 |
+
- **Base Model:** meta-llama/Meta-Llama-3.1-8B
|
30 |
+
|
31 |
+
## Unique Training Approach
|
32 |
+
|
33 |
+
This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
|
34 |
+
|
35 |
+
- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
|
36 |
+
- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
|
37 |
+
- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
|
38 |
+
- **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
|
39 |
+
- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.
|
40 |
+
|
41 |
+
## Training Data
|
42 |
+
|
43 |
+
The model is trained on a blend of high-quality data sources:
|
44 |
+
|
45 |
+
- **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
|
46 |
+
- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
|
47 |
+
- **English, Spanish, and French Wikipedia:** For broad language understanding.
|
48 |
+
- **Starcoder:** High-quality Python-focused code dataset for code completion tasks.
|
49 |
+
|
50 |
+
## Training Procedure
|
51 |
+
|
52 |
+
### Setup
|
53 |
+
|
54 |
+
```bash
|
55 |
+
pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
|
56 |
+
```
|
57 |
+
|
58 |
+
## Key Features
|
59 |
+
|
60 |
+
1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
|
61 |
+
2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
|
62 |
+
3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
|
63 |
+
4. **Gradient Checkpointing:** Enables training with limited GPU memory.
|
64 |
+
5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
|
65 |
+
6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.
|
66 |
+
|
67 |
+
## Advanced Training Techniques
|
68 |
+
|
69 |
+
This model incorporates several advanced training techniques to enhance its capabilities:
|
70 |
+
|
71 |
+
### 1. Fill-in-the-Middle (FIM) Capability
|
72 |
+
|
73 |
+
FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
|
74 |
+
|
75 |
+
#### Using FIM with the Model
|
76 |
+
|
77 |
+
To use the FIM capability, structure your input with special tokens:
|
78 |
+
|
79 |
+
- `<|fim_start|>`: Marks the start of the FIM input
|
80 |
+
- `<|fim_marker|>`: Separates the prefix from the suffix
|
81 |
+
- `<|fim_gen|>`: Indicates where the generated content should begin
|
82 |
+
- `<|fim_end|>`: Marks the end of the FIM input
|
83 |
+
|
84 |
+
Example FIM input:
|
85 |
+
```
|
86 |
+
<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
|
87 |
+
```
|
88 |
+
|
89 |
+
The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.
|
90 |
+
|
91 |
+
### 2. Reverse Prediction and Instruction Backtranslation
|
92 |
+
|
93 |
+
This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
|
94 |
+
|
95 |
+
#### How it works:
|
96 |
+
1. The model is given a snippet of conversation or text.
|
97 |
+
2. It's then tasked with predicting what came before this snippet.
|
98 |
+
3. This process helps the model understand context, conversation flow, and logical progression of ideas.
|
99 |
+
|
100 |
+
#### Benefits:
|
101 |
+
- Improved context understanding
|
102 |
+
- Enhanced ability to maintain coherent, contextually appropriate conversations
|
103 |
+
- Better grasp of cause-and-effect relationships in text
|
104 |
+
|
105 |
+
#### Example use case:
|
106 |
+
Input:
|
107 |
+
```
|
108 |
+
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
|
109 |
+
```
|
110 |
+
Task: Predict the previous exchange in this conversation.
|
111 |
+
|
112 |
+
Possible model output:
|
113 |
+
```
|
114 |
+
Human: What's the capital of France?
|
115 |
+
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
|
116 |
+
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
|
117 |
+
```
|
118 |
+
|
119 |
+
### 3. Meta-FIM
|
120 |
+
|
121 |
+
Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
|
122 |
+
|
123 |
+
#### Benefits:
|
124 |
+
- Enhanced understanding of long-range dependencies in text
|
125 |
+
- Improved ability to maintain coherence across longer contexts
|
126 |
+
- Better performance on tasks requiring integration of information from multiple parts of a document or conversation
|
127 |
+
|
128 |
+
#### Example:
|
129 |
+
```
|
130 |
+
<|fim_start|>Human: What's the weather like today?
|
131 |
+
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
|
132 |
+
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
|
133 |
+
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
|
134 |
+
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
|
135 |
+
1. Eiffel Tower
|
136 |
+
2. Louvre Museum
|
137 |
+
3. Notre-Dame Cathedral
|
138 |
+
4. Arc de Triomphe
|
139 |
+
5. Sacré-Cœur Basilica<|fim_end|>
|
140 |
+
```
|
141 |
+
|
142 |
+
In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
|
143 |
+
|
144 |
+
## Evaluation
|
145 |
+
|
146 |
+
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|
147 |
+
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
|
148 |
+
|tinyBenchmarks | N/A| | | | | | | |
|
149 |
+
| - tinyArc | 0|none | 25|acc_norm |↑ |0.5791|± | N/A|
|
150 |
+
| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.5536|± | N/A|
|
151 |
+
| | |strict-match | 5|exact_match|↑ |0.5536|± | N/A|
|
152 |
+
| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.8391|± | N/A|
|
153 |
+
| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.6377|± | N/A|
|
154 |
+
| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4914|± | N/A|
|
155 |
+
| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.7608|± | N/A|
|
156 |
+
|
157 |
+
### Training Command
|
158 |
+
|
159 |
+
```bash
|
160 |
+
python sft_14.py \
|
161 |
+
--run_name="llama3.1-8b-continued3" \
|
162 |
+
--model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
|
163 |
+
--dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
|
164 |
+
--report_to="wandb" \
|
165 |
+
--optim="adamw_bnb_8bit" \
|
166 |
+
--lr_scheduler_type="cosine" \
|
167 |
+
--max_steps=100000 \
|
168 |
+
--max_seq_length=64000 \
|
169 |
+
--learning_rate=0.00001 \
|
170 |
+
--attn_implementation="flash_attention_2" \
|
171 |
+
--save_strategy="steps" \
|
172 |
+
--save_steps 50 \
|
173 |
+
--save_total_limit=10 \
|
174 |
+
--per_device_train_batch_size=1 \
|
175 |
+
--per_device_eval_batch_size=1 \
|
176 |
+
--gradient_accumulation_steps=8 \
|
177 |
+
--logging_steps=1 \
|
178 |
+
--num_train_epochs=1 \
|
179 |
+
--push_to_hub \
|
180 |
+
--hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
|
181 |
+
--hub_strategy="all_checkpoints" \
|
182 |
+
--gradient_checkpointing \
|
183 |
+
--use_liger=true \
|
184 |
+
--packing=true \
|
185 |
+
--torch_dtype="bfloat16" \
|
186 |
+
--output_dir="continuedtraining3_output"
|
187 |
+
```
|
188 |
+
|
189 |
+
## Intended Uses
|
190 |
+
|
191 |
+
This model is designed for:
|
192 |
+
|
193 |
+
- Text Completion and Generation
|
194 |
+
- Code Completion (especially Python)
|
195 |
+
- Instruction Following
|
196 |
+
- General Language Understanding
|
197 |
+
- Context-Aware Text Infilling (using FIM)
|
198 |
+
|
199 |
+
## Limitations and Biases
|
200 |
+
|
201 |
+
- The model may exhibit biases present in the training data.
|
202 |
+
- It lacks real-time knowledge beyond its training data.
|
203 |
+
- Should not be used for critical decision-making without human oversight.
|
204 |
+
|
205 |
+
## Technical Specifications
|
206 |
+
|
207 |
+
- **Base Model:** meta-llama/Meta-Llama-3.1-8B
|
208 |
+
- **Training Approach:** Full Fine-Tuning
|
209 |
+
- **Library:** Hugging Face Transformers and TRL
|
210 |
+
|
211 |
+
## Contact
|
212 |
+
|
213 |
+
For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co/ericflo/Llama-3.1-8B-ContinuedTraining3-FFT).
|