File size: 14,212 Bytes
5c5b669 8a60638 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 |
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
---
# Uploaded model
- **Developed by:** Asuncom
- **License:** apache-2.0
- **Finetuned from model :** unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
```python
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
```
```python
!pip install --upgrade pip
```
```python
!pip install --no-deps "xformers<0.0.26" "trl<0.9.0" peft accelerate bitsandbytes
```
```python
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Llama-3.1 15 trillion tokens model 2x faster!
"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
"unsloth/Meta-Llama-3.1-70B-bnb-4bit",
"unsloth/Meta-Llama-3.1-405B-bnb-4bit", # We also uploaded 4bit for 405b!
"unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
"unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
"unsloth/mistral-7b-v0.3-bnb-4bit", # Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!d
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/gemma-2-9b-bnb-4bit",
"unsloth/gemma-2-27b-bnb-4bit", # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
```
```python
# ========================================================
# Test before training
# ========================================================
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
alpaca_prompt.format(
"请把现代汉语翻译成古文", # instruction
"其品行廉正,所以至死也不放松对自己的要求。", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
```
```python
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
```
```python
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("Asuncom/shiji-qishiliezhuan", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
```
```python
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 100,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
```
```python
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
```
```python
import wandb
# 初始化一个离线模式的W&B运行
wandb.init(mode="offline", project="asuncom", entity="asuncom")
```
```python
trainer_stats = trainer.train()
```
```python
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
```
```python
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
alpaca_prompt.format(
"请把现代汉语翻译成古文", # instruction
"其品行廉正,所以至死也不放松对自己的要求。", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
```
```python
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", token = "hf_huggingface的密钥NeKb") # Online saving
tokenizer.push_to_hub("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", token = "hf_huggingface的密钥saving
```
```python
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# alpaca_prompt = You MUST copy from above!
inputs = tokenizer(
[
alpaca_prompt.format(
"What is a famous tall tower in Paris?", # instruction
"", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
```
```python
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "merged_16bit", token = "hf_huggingface的密钥NeKb")
# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "merged_4bit", token = "hf_huggingface的密钥oRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, save_method = "lora", token = "hf_huggingface的密钥
```
```python
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, token = "")
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, quantization_method = "f16", token = "")
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("Asuncom/Llama-3.1-8B-bnb-4bit-shiji", tokenizer, quantization_method = "q4_k_m", token = "hf_xxxxx")
# Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"Asuncom/Llama-3.1-8B-bnb-4bit-shiji", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "hf_huggingface的密钥NeKb", # Get a token at https://huggingface.co/settings/tokens
)
```
```python
model.push_to_hub_gguf(
"Asuncom/Llama-3.1-8B-bnb-4bit-shiji", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "hf_huggingface的密钥NeKb", # Get a token at https://huggingface.co/settings/tokens
)
```
```
[ 279/ 292] blk.30.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
[ 280/ 292] blk.30.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
[ 281/ 292] blk.30.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q6_K .. size = 8.00 MiB -> 3.28 MiB
[ 282/ 292] blk.31.ffn_gate.weight - [ 4096, 14336, 1, 1], type = f16, converting to q5_K .. size = 112.00 MiB -> 38.50 MiB
[ 283/ 292] blk.31.ffn_up.weight - [ 4096, 14336, 1, 1], type = f16, converting to q5_K .. size = 112.00 MiB -> 38.50 MiB
[ 284/ 292] blk.31.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q5_K .. size = 8.00 MiB -> 2.75 MiB
[ 285/ 292] blk.31.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
[ 286/ 292] blk.31.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q5_K .. size = 32.00 MiB -> 11.00 MiB
[ 287/ 292] blk.31.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q6_K .. size = 8.00 MiB -> 3.28 MiB
[ 288/ 292] output.weight - [ 4096, 128256, 1, 1], type = f16, converting to q6_K .. size = 1002.00 MiB -> 410.98 MiB
[ 289/ 292] blk.31.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 290/ 292] blk.31.ffn_down.weight - [14336, 4096, 1, 1], type = f16, converting to q6_K .. size = 112.00 MiB -> 45.94 MiB
[ 291/ 292] blk.31.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 292/ 292] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
llama_model_quantize_internal: model size = 15317.02 MB
llama_model_quantize_internal: quant size = 5459.93 MB
main: quantize time = 147401.53 ms
main: total time = 147401.53 ms
Unsloth: Conversion completed! Output location: ./Asuncom/Llama-3.1-8B-bnb-4bit-shiji/unsloth.Q5_K_M.gguf
Unsloth: Uploading GGUF to Huggingface Hub...
unsloth.F16.gguf: 100%|██████████| 16.1G/16.1G [26:20<00:00, 10.2MB/s]
Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
Unsloth: Uploading GGUF to Huggingface Hub...
unsloth.Q4_K_M.gguf: 100%|██████████| 4.92G/4.92G [08:05<00:00, 10.1MB/s]
Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
Unsloth: Uploading GGUF to Huggingface Hub...
unsloth.Q8_0.gguf: 100%|██████████| 8.54G/8.54G [13:48<00:00, 10.3MB/s]
Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shiji
Unsloth: Uploading GGUF to Huggingface Hub...
unsloth.Q5_K_M.gguf: 100%|██████████| 5.73G/5.73G [09:24<00:00, 10.2MB/s]
Saved GGUF to https://huggingface.co/Asuncom/Llama-3.1-8B-bnb-4bit-shipython
``` |