Submitting job: /common/home/users/d/dh.huang.2023/code/logical-reasoning/scripts/tune-mgtv.sh Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning Tue Jul 16 09:15:08 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40 On | 00000000:01:00.0 Off | 0 | | N/A 29C P8 35W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Linux lexicon 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Thu Jun 6 09:41:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux NAME="Rocky Linux" VERSION="8.10 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" SUPPORT_END="2029-05-31" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.10" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.10" Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 CPU MHz: 2450.000 CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4890.67 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm MemTotal: 527669560 kB Tuning shenzhi-wang/Llama3-8B-Chinese-Chat with config/llama3-8b_lora_sft_bf16-p1.yaml Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning/llama-factory config/llama3-8b_lora_sft_bf16-p1.yaml: { "model_name_or_path": "shenzhi-wang/Llama3-8B-Chinese-Chat", "stage": "sft", "do_train": true, "finetuning_type": "lora", "lora_target": "all", "loraplus_lr_ratio": 16.0, "upcast_layernorm": true, "dataset": "alpaca_mgtv_p1", "template": "llama3", "cutoff_len": 4096, "max_samples": 25000, "overwrite_cache": true, "preprocessing_num_workers": 16, "output_dir": "saves/llama3-8b/lora/sft_bf16_p1_full", "logging_steps": 10, "save_steps": 175, "plot_loss": true, "overwrite_output_dir": true, "per_device_train_batch_size": 16, "gradient_accumulation_steps": 8, "learning_rate": 0.0001, "num_train_epochs": 6.0, "lr_scheduler_type": "cosine", "warmup_ratio": 0.1, "bf16": true, "ddp_timeout": 180000000, "val_size": 0.1, "per_device_eval_batch_size": 1, "eval_strategy": "steps", "eval_steps": 175, "report_to": "wandb", "run_name": "llama3_8b_p1_full" } 07/16/2024 09:15:26 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2108] 2024-07-16 09:15:26,639 >> loading file tokenizer.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer.json [INFO|tokenization_utils_base.py:2108] 2024-07-16 09:15:26,639 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2108] 2024-07-16 09:15:26,639 >> loading file special_tokens_map.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/special_tokens_map.json [INFO|tokenization_utils_base.py:2108] 2024-07-16 09:15:26,639 >> loading file tokenizer_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer_config.json [WARNING|logging.py:314] 2024-07-16 09:15:27,174 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/16/2024 09:15:27 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|> 07/16/2024 09:15:27 - INFO - llamafactory.data.loader - Loading dataset alpaca_mgtv_p1.json... Converting format of dataset (num_proc=16): 0%| | 0/25000 [00:00> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 09:15:32,903 >> Model config LlamaConfig { "_name_or_path": "shenzhi-wang/Llama3-8B-Chinese-Chat", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|modeling_utils.py:3474] 2024-07-16 09:15:33,043 >> loading weights file model.safetensors from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-07-16 09:15:33,051 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:962] 2024-07-16 09:15:33,052 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009 } input_ids: [128000, 128006, 882, 128007, 271, 57668, 122503, 11589, 119, 48039, 105745, 9554, 36668, 69978, 17792, 1811, 105745, 75486, 47548, 121589, 49543, 16, 13, 111521, 58318, 30046, 38093, 113885, 48044, 39013, 250, 34972, 9174, 17, 13, 111521, 58318, 30046, 74770, 68438, 29172, 57107, 37507, 47012, 44368, 52084, 3922, 16175, 251, 42421, 50338, 30867, 39013, 250, 34972, 9174, 18, 13, 70262, 35304, 74257, 19483, 87219, 3922, 36668, 69978, 17792, 45163, 110747, 115827, 106041, 113925, 88852, 76208, 19483, 31867, 48982, 114554, 5232, 21043, 5486, 103668, 5486, 16937, 107693, 5486, 113925, 90091, 5486, 57107, 25333, 33200, 9174, 19, 13, 115228, 103899, 16325, 54253, 43955, 109545, 42246, 103282, 28469, 104587, 54253, 66870, 105838, 31867, 48982, 105363, 109545, 48044, 19113, 1811, 78657, 102836, 74770, 102178, 2118, 103668, 863, 66870, 105838, 13153, 2118, 16937, 863, 9174, 20, 13, 111521, 58318, 30046, 86206, 110747, 113925, 37507, 84851, 22649, 91495, 32335, 105866, 93233, 20834, 39013, 250, 34972, 9554, 90091, 126427, 3490, 15225, 109759, 35083, 117026, 108787, 75486, 47548, 113925, 124080, 30046, 29172, 114606, 87219, 3490, 39013, 250, 34972, 25, 74662, 7518, 226, 46729, 101634, 70349, 106258, 48044, 102491, 92877, 9554, 42783, 37687, 5232, 74257, 8107, 59563, 124743, 113928, 51109, 9554, 105343, 56602, 3922, 59563, 124743, 101129, 70349, 60843, 19361, 48044, 32335, 108199, 59563, 124743, 38093, 16937, 119184, 69636, 107163, 3922, 101634, 70821, 80578, 33764, 33091, 47551, 47523, 110482, 115461, 16937, 50338, 1811, 15225, 93233, 20834, 59563, 124743, 21388, 111930, 103, 105068, 121964, 113385, 3490, 115827, 106041, 25, 114794, 50021, 127314, 58318, 15120, 25129, 8107, 10287, 230, 9554, 107019, 101908, 113713, 110477, 25129, 107019, 101908, 8107, 106196, 13646, 3922, 106638, 58318, 15120, 25129, 58666, 107921, 9554, 118230, 107571, 50021, 111943, 1811, 104563, 95337, 23187, 19000, 59563, 124743, 113928, 51109, 9554, 105343, 56602, 37985, 106367, 1811, 116764, 3922, 51609, 91940, 119760, 17792, 3922, 118230, 107571, 19000, 106367, 107591, 25580, 105067, 83324, 37689, 48915, 16325, 104787, 101083, 1811, 116292, 108803, 9554, 107019, 101908, 109002, 108142, 104611, 64209, 103312, 9554, 118230, 107571, 116255, 8107, 127579, 45163, 32335, 108199, 59563, 124743, 113150, 102149, 3922, 54322, 28037, 118230, 107571, 9554, 124652, 25580, 105610, 33091, 108385, 105508, 107924, 20648, 222, 91763, 110477, 15120, 116405, 109337, 105196, 35287, 43240, 8107, 3922, 13153, 109002, 108209, 101634, 70349, 48044, 101365, 109243, 9554, 42783, 37687, 3490, 124080, 30046, 29172, 114606, 87219, 25, 108837, 115, 104123, 22023, 101365, 103054, 198, 128009, 128006, 78191, 128007, 271, 103668, 128009] inputs: <|begin_of_text|><|start_header_id|>user<|end_header_id|> 你是一个逻辑游戏的主持人。游戏规则如下: 1. 参与者会得到一个谜题。 2. 参与者可以通过提问来获取线索,尝试解开谜题。 3. 对于每个问题,主持人将根据实际情况回答以下五个选项之一:是、不是、不重要、回答正确、问法错误。 4. 回答中不能添加任何其它信息,也不能省略选项中的任何一个字。例如,不可以把“不是”省略成“不”。 5. 参与者需要根据回答来推理,并最终找出谜题的正确答案。 请严格按照这些规则回答参与者提出的问题。 谜题: 在甄家村里,有一个古老的传说:每年南瓜丰收的季节,南瓜田里总有一个最大的南瓜会不翼而飞,村民们对此现象困惑不解。请找出南瓜失踪背后的原因。 实际情况: 真相原来与一位年迈的农夫有关。这位农夫年轻时,曾与一位美丽的姑娘相恋。他们约定在南瓜丰收的季节结婚。然而,命运弄人,姑娘在婚礼前的一场意外中离世。悲伤的农夫为了纪念心爱的姑娘,每年都会将最大的南瓜偷走,放到姑娘的墓前,以此寄托自己的哀思。这一行为延续了多年,成为了乡村里一个神秘的传说。 参与者提出的问题: 偷的人信神吗 <|eot_id|><|start_header_id|>assistant<|end_header_id|> 不是<|eot_id|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 103668, 128009] labels: 不是<|eot_id|> Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4288] 2024-07-16 09:15:50,224 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at shenzhi-wang/Llama3-8B-Chinese-Chat. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:917] 2024-07-16 09:15:50,463 >> loading configuration file generation_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/generation_config.json [INFO|configuration_utils.py:962] 2024-07-16 09:15:50,463 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009, "pad_token_id": 128009 } 07/16/2024 09:15:50 - INFO - llamafactory.model.model_utils.checkpointing - Upcasting layernorm weights in float32. 07/16/2024 09:15:50 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 07/16/2024 09:15:50 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 07/16/2024 09:15:50 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 07/16/2024 09:15:50 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 07/16/2024 09:15:50 - INFO - llamafactory.model.model_utils.misc - Found linear modules: down_proj,up_proj,gate_proj,k_proj,v_proj,o_proj,q_proj 07/16/2024 09:15:50 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605 Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:641] 2024-07-16 09:15:51,000 >> Using auto half precision backend 07/16/2024 09:15:51 - WARNING - llamafactory.train.callbacks - Previous trainer log in this folder will be deleted. 07/16/2024 09:15:51 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. [INFO|trainer.py:2078] 2024-07-16 09:15:51,222 >> ***** Running training ***** [INFO|trainer.py:2079] 2024-07-16 09:15:51,222 >> Num examples = 22,500 [INFO|trainer.py:2080] 2024-07-16 09:15:51,222 >> Num Epochs = 6 [INFO|trainer.py:2081] 2024-07-16 09:15:51,222 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2084] 2024-07-16 09:15:51,222 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2085] 2024-07-16 09:15:51,222 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2086] 2024-07-16 09:15:51,222 >> Total optimization steps = 1,050 [INFO|trainer.py:2087] 2024-07-16 09:15:51,225 >> Number of trainable parameters = 20,971,520 [INFO|integration_utils.py:723] 2024-07-16 09:15:51,228 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: inflaton-sg (inflaton-ai). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.17.4 wandb: Run data is saved locally in /common2/dh.huang.2023/code/logical-reasoning/llama-factory/wandb/run-20240716_091552-t1nqedbn wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama3_8b_p1_full wandb: ⭐️ View project at https://wandb.ai/inflaton-ai/huggingface wandb: 🚀 View run at https://wandb.ai/inflaton-ai/huggingface/runs/t1nqedbn 0%| | 0/1050 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 11:19:02,509 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 11:19:02,509 >> Batch size = 1 {'loss': 0.9166, 'grad_norm': 1.478421688079834, 'learning_rate': 9.523809523809523e-06, 'epoch': 0.06} {'loss': 0.3687, 'grad_norm': 0.9236305356025696, 'learning_rate': 1.9047619047619046e-05, 'epoch': 0.11} {'loss': 0.328, 'grad_norm': 0.6159961223602295, 'learning_rate': 2.857142857142857e-05, 'epoch': 0.17} {'loss': 0.3049, 'grad_norm': 1.8057916164398193, 'learning_rate': 3.809523809523809e-05, 'epoch': 0.23} {'loss': 0.2999, 'grad_norm': 1.5811758041381836, 'learning_rate': 4.761904761904762e-05, 'epoch': 0.28} {'loss': 0.2951, 'grad_norm': 1.9880269765853882, 'learning_rate': 5.714285714285714e-05, 'epoch': 0.34} {'loss': 0.282, 'grad_norm': 1.5736329555511475, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.4} {'loss': 0.2805, 'grad_norm': 1.3553781509399414, 'learning_rate': 7.619047619047618e-05, 'epoch': 0.45} {'loss': 0.2717, 'grad_norm': 1.3975526094436646, 'learning_rate': 8.571428571428571e-05, 'epoch': 0.51} {'loss': 0.2659, 'grad_norm': 1.0928113460540771, 'learning_rate': 9.523809523809524e-05, 'epoch': 0.57} {'loss': 0.27, 'grad_norm': 1.163079857826233, 'learning_rate': 9.999309273455528e-05, 'epoch': 0.63} {'loss': 0.2671, 'grad_norm': 0.849062979221344, 'learning_rate': 9.993784606094612e-05, 'epoch': 0.68} {'loss': 0.2869, 'grad_norm': 0.8933653831481934, 'learning_rate': 9.982741376606078e-05, 'epoch': 0.74} {'loss': 0.2997, 'grad_norm': 1.046842098236084, 'learning_rate': 9.966191788709716e-05, 'epoch': 0.8} {'loss': 0.2735, 'grad_norm': 2.4692931175231934, 'learning_rate': 9.944154131125642e-05, 'epoch': 0.85} {'loss': 0.2905, 'grad_norm': 1.4990626573562622, 'learning_rate': 9.916652757363698e-05, 'epoch': 0.91} {'loss': 0.258, 'grad_norm': 1.4853521585464478, 'learning_rate': 9.883718058810707e-05, 'epoch': 0.97} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-175 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 11:22:29,531 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 11:22:29,532 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 11:22:29,811 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-175/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 11:22:29,813 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-175/special_tokens_map.json 17%|█▋ | 176/1050 [2:07:05<25:03:40, 103.23s/it] 17%|█▋ | 177/1050 [2:07:48<20:37:34, 85.06s/it] 17%|█▋ | 178/1050 [2:08:30<17:28:30, 72.15s/it] 17%|█▋ | 179/1050 [2:09:12<15:17:16, 63.19s/it] 17%|█▋ | 180/1050 [2:09:54<13:45:44, 56.95s/it] 17%|█▋ | 180/1050 [2:09:54<13:45:44, 56.95s/it] 17%|█▋ | 181/1050 [2:10:36<12:38:13, 52.35s/it] 17%|█▋ | 182/1050 [2:11:19<11:56:29, 49.53s/it] 17%|█▋ | 183/1050 [2:12:02<11:26:15, 47.49s/it] 18%|█▊ | 184/1050 [2:12:44<11:02:31, 45.90s/it] 18%|█▊ | 185/1050 [2:13:26<10:48:04, 44.95s/it] 18%|█▊ | 186/1050 [2:14:09<10:36:37, 44.21s/it] 18%|█▊ | 187/1050 [2:14:51<10:24:30, 43.42s/it] 18%|█▊ | 188/1050 [2:15:33<10:18:33, 43.06s/it] 18%|█▊ | 189/1050 [2:16:14<10:11:26, 42.61s/it] 18%|█▊ | 190/1050 [2:16:58<10:13:26, 42.80s/it] 18%|█▊ | 190/1050 [2:16:58<10:13:26, 42.80s/it] 18%|█▊ | 191/1050 [2:17:40<10:09:44, 42.59s/it] 18%|█▊ | 192/1050 [2:18:23<10:10:19, 42.68s/it] 18%|█▊ | 193/1050 [2:19:04<10:04:57, 42.35s/it] 18%|█▊ | 194/1050 [2:19:46<10:00:26, 42.09s/it] 19%|█▊ | 195/1050 [2:20:28<10:00:41, 42.15s/it] 19%|█▊ | 196/1050 [2:21:09<9:55:36, 41.85s/it] 19%|█▉ | 197/1050 [2:21:52<9:57:55, 42.06s/it] 19%|█▉ | 198/1050 [2:22:33<9:56:26, 42.00s/it] 19%|█▉ | 199/1050 [2:23:16<9:56:34, 42.06s/it] 19%|█▉ | 200/1050 [2:23:57<9:53:02, 41.86s/it] 19%|█▉ | 200/1050 [2:23:57<9:53:02, 41.86s/it] 19%|█▉ | 201/1050 [2:24:39<9:54:01, 41.98s/it] 19%|█▉ | 202/1050 [2:25:22<9:54:47, 42.08s/it] 19%|█▉ | 203/1050 [2:26:04<9:53:58, 42.08s/it] 19%|█▉ | 204/1050 [2:26:46<9:52:17, 42.01s/it] 20%|█▉ | 205/1050 [2:27:28<9:51:51, 42.03s/it] 20%|█▉ | 206/1050 [2:28:09<9:50:21, 41.97s/it] 20%|█▉ | 207/1050 [2:28:51<9:49:06, 41.93s/it] 20%|█▉ | 208/1050 [2:29:33<9:47:41, 41.88s/it] 20%|█▉ | 209/1050 [2:30:15<9:48:54, 42.01s/it] 20%|██ | 210/1050 [2:30:58<9:49:35, 42.11s/it] 20%|██ | 210/1050 [2:30:58<9:49:35, 42.11s/it] 20%|██ | 211/1050 [2:31:40<9:49:03, 42.13s/it] 20%|██ | 212/1050 [2:32:22<9:47:22, 42.06s/it] 20%|██ | 213/1050 [2:33:04<9:47:28, 42.11s/it] 20%|██ | 214/1050 [2:33:46<9:45:57, 42.06s/it] 20%|██ | 215/1050 [2:34:28<9:45:18, 42.06s/it] 21%|██ | 216/1050 [2:35:11<9:46:43, 42.21s/it] 21%|██ | 217/1050 [2:35:52<9:44:43, 42.12s/it] 21%|██ | 218/1050 [2:36:35<9:44:27, 42.15s/it] 21%|██ | 219/1050 [2:37:17<9:44:23, 42.19s/it] 21%|██ | 220/1050 [2:37:59<9:41:28, 42.03s/it] 21%|██ | 220/1050 [2:37:59<9:41:28, 42.03s/it] 21%|██ | 221/1050 [2:38:40<9:37:07, 41.77s/it] 21%|██ | 222/1050 [2:39:21<9:35:04, 41.67s/it] 21%|██ | 223/1050 [2:40:04<9:39:21, 42.03s/it] 21%|██▏ | 224/1050 [2:40:47<9:43:04, 42.35s/it] 21%|██▏ | 225/1050 [2:41:29<9:40:05, 42.19s/it] 22%|██▏ | 226/1050 [2:42:11<9:37:20, 42.04s/it] 22%|██▏ | 227/1050 [2:42:53<9:37:37, 42.11s/it] 22%|██▏ | 228/1050 [2:43:35<9:36:10, 42.06s/it] 22%|██▏ | 229/1050 [2:44:17<9:34:40, 42.00s/it] 22%|██▏ | 230/1050 [2:44:59<9:34:05, 42.01s/it] 22%|██▏ | 230/1050 [2:44:59<9:34:05, 42.01s/it] 22%|██▏ | 231/1050 [2:45:41<9:33:12, 41.99s/it] 22%|██▏ | 232/1050 [2:46:24<9:36:02, 42.25s/it] 22%|██▏ | 233/1050 [2:47:05<9:31:49, 41.99s/it] 22%|██▏ | 234/1050 [2:47:47<9:32:17, 42.08s/it] 22%|██▏ | 235/1050 [2:48:30<9:33:27, 42.22s/it] 22%|██▏ | 236/1050 [2:49:12<9:32:25, 42.19s/it] 23%|██▎ | 237/1050 [2:49:54<9:30:27, 42.10s/it] 23%|██▎ | 238/1050 [2:50:37<9:34:22, 42.44s/it] 23%|██▎ | 239/1050 [2:51:18<9:28:46, 42.08s/it] 23%|██▎ | 240/1050 [2:52:00<9:27:29, 42.04s/it] 23%|██▎ | 240/1050 [2:52:00<9:27:29, 42.04s/it] 23%|██▎ | 241/1050 [2:52:43<9:28:55, 42.19s/it] 23%|██▎ | 242/1050 [2:53:24<9:24:45, 41.94s/it] 23%|██▎ | 243/1050 [2:54:06<9:22:47, 41.84s/it] 23%|██▎ | 244/1050 [2:54:49<9:28:14, 42.30s/it] 23%|██▎ | 245/1050 [2:55:31<9:24:02, 42.04s/it] 23%|██▎ | 246/1050 [2:56:13<9:24:04, 42.09s/it] 24%|██▎ | 247/1050 [2:56:55<9:25:17, 42.24s/it] 24%|██▎ | 248/1050 [2:57:37<9:23:12, 42.13s/it] 24%|██▎ | 249/1050 [2:58:20<9:24:04, 42.25s/it] 24%|██▍ | 250/1050 [2:59:01<9:19:15, 41.94s/it] 24%|██▍ | 250/1050 [2:59:01<9:19:15, 41.94s/it] 24%|██▍ | 251/1050 [2:59:44<9:22:20, 42.23s/it] 24%|██▍ | 252/1050 [3:00:25<9:16:06, 41.81s/it] 24%|██▍ | 253/1050 [3:01:07<9:16:33, 41.90s/it] 24%|██▍ | 254/1050 [3:01:49<9:15:40, 41.88s/it] 24%|██▍ | 255/1050 [3:02:30<9:13:11, 41.75s/it] 24%|██▍ | 256/1050 [3:03:12<9:13:57, 41.86s/it] 24%|██▍ | 257/1050 [3:03:54<9:12:03, 41.77s/it] 25%|██▍ | 258/1050 [3:04:36<9:11:08, 41.75s/it] 25%|██▍ | 259/1050 [3:05:18<9:11:31, 41.84s/it] 25%|██▍ | 260/1050 [3:05:59<9:10:26, 41.81s/it] 25%|██▍ | 260/1050 [3:05:59<9:10:26, 41.81s/it] 25%|██▍ | 261/1050 [3:06:41<9:10:00, 41.83s/it] 25%|██▍ | 262/1050 [3:07:24<9:11:37, 42.00s/it] 25%|██▌ | 263/1050 [3:08:06<9:12:50, 42.15s/it] 25%|██▌ | 264/1050 [3:08:49<9:16:17, 42.46s/it] 25%|██▌ | 265/1050 [3:09:32<9:15:22, 42.45s/it] 25%|██▌ | 266/1050 [3:10:13<9:11:00, 42.17s/it] 25%|██▌ | 267/1050 [3:10:56<9:10:49, 42.21s/it] 26%|██▌ | 268/1050 [3:11:38<9:09:47, 42.18s/it] 26%|██▌ | 269/1050 [3:12:19<9:07:21, 42.05s/it] 26%|██▌ | 270/1050 [3:13:02<9:06:59, 42.08s/it] 26%|██▌ | 270/1050 [3:13:02<9:06:59, 42.08s/it] 26%|██▌ | 271/1050 [3:13:43<9:02:45, 41.80s/it] 26%|██▌ | 272/1050 [3:14:24<8:59:16, 41.59s/it] 26%|██▌ | 273/1050 [3:15:06<9:02:53, 41.92s/it] 26%|██▌ | 274/1050 [3:15:48<9:01:39, 41.88s/it] 26%|██▌ | 275/1050 [3:16:31<9:05:37, 42.24s/it] 26%|██▋ | 276/1050 [3:17:13<9:01:54, 42.01s/it] 26%|██▋ | 277/1050 [3:17:55<9:01:10, 42.01s/it] 26%|██▋ | 278/1050 [3:18:37<9:02:09, 42.14s/it] 27%|██▋ | 279/1050 [3:19:20<9:02:00, 42.18s/it] 27%|██▋ | 280/1050 [3:20:01<8:57:52, 41.91s/it] 27%|██▋ | 280/1050 [3:20:01<8:57:52, 41.91s/it] 27%|██▋ | 281/1050 [3:20:43<8:56:41, 41.87s/it] 27%|██▋ | 282/1050 [3:21:26<9:02:58, 42.42s/it] 27%|██▋ | 283/1050 [3:22:10<9:09:05, 42.95s/it] 27%|██▋ | 284/1050 [3:22:53<9:05:46, 42.75s/it] 27%|██▋ | 285/1050 [3:23:34<8:59:27, 42.31s/it] 27%|██▋ | 286/1050 [3:24:16<8:56:28, 42.13s/it] 27%|██▋ | 287/1050 [3:24:58<8:55:51, 42.14s/it] 27%|██▋ | 288/1050 [3:25:39<8:52:47, 41.95s/it] 28%|██▊ | 289/1050 [3:26:21<8:51:04, 41.87s/it] 28%|██▊ | 290/1050 [3:27:03<8:51:59, 42.00s/it] 28%|██▊ | 290/1050 [3:27:03<8:51:59, 42.00s/it] 28%|██▊ | 291/1050 [3:27:47<8:55:38, 42.34s/it] 28%|██▊ | 292/1050 [3:28:28<8:53:14, 42.21s/it] 28%|██▊ | 293/1050 [3:29:10<8:50:17, 42.03s/it] 28%|██▊ | 294/1050 [3:29:52<8:48:22, 41.93s/it] 28%|██▊ | 295/1050 [3:30:34<8:50:01, 42.12s/it] 28%|██▊ | 296/1050 [3:31:17<8:53:00, 42.42s/it] 28%|██▊ | 297/1050 [3:31:58<8:47:07, 42.00s/it] 28%|██▊ | 298/1050 [3:32:40<8:46:00, 41.97s/it] 28%|██▊ | 299/1050 [3:33:22<8:45:37, 41.99s/it] 29%|██▊ | 300/1050 [3:34:04<8:43:45, 41.90s/it] 29%|██▊ | 300/1050 [3:34:04<8:43:45, 41.90s/it] 29%|██▊ | 301/1050 [3:34:46<8:41:56, 41.81s/it] 29%|██▉ | 302/1050 [3:35:28<8:41:40, 41.85s/it] 29%|██▉ | 303/1050 [3:36:10<8:41:48, 41.91s/it] 29%|██▉ | 304/1050 [3:36:53<8:44:29, 42.18s/it] 29%|██▉ | 305/1050 [3:37:34<8:41:03, 41.96s/it] 29%|██▉ | 306/1050 [3:38:16<8:39:09, 41.87s/it] 29%|██▉ | 307/1050 [3:38:59<8:44:32, 42.36s/it] 29%|██▉ | 308/1050 [3:39:41<8:43:54, 42.36s/it] 29%|██▉ | 309/1050 [3:40:24<8:44:00, 42.43s/it] 30%|██▉ | 310/1050 [3:41:06<8:40:57, 42.24s/it] 30%|██▉ | 310/1050 [3:41:06<8:40:57, 42.24s/it] 30%|██▉ | 311/1050 [3:41:48<8:41:35, 42.35s/it] 30%|██▉ | 312/1050 [3:42:30<8:37:49, 42.10s/it] 30%|██▉ | 313/1050 [3:43:12<8:36:52, 42.08s/it] 30%|██▉ | 314/1050 [3:43:54<8:37:07, 42.16s/it] 30%|███ | 315/1050 [3:44:36<8:33:04, 41.88s/it] 30%|███ | 316/1050 [3:45:18<8:32:56, 41.93s/it] 30%|███ | 317/1050 [3:45:59<8:31:18, 41.85s/it] 30%|███ | 318/1050 [3:46:42<8:31:50, 41.95s/it] 30%|███ | 319/1050 [3:47:23<8:31:09, 41.96s/it] 30%|███ | 320/1050 [3:48:06<8:33:45, 42.23s/it] 30%|███ | 320/1050 [3:48:06<8:33:45, 42.23s/it] 31%|███ | 321/1050 [3:48:48<8:29:50, 41.96s/it] 31%|███ | 322/1050 [3:49:30<8:31:03, 42.12s/it] 31%|███ | 323/1050 [3:50:11<8:27:03, 41.85s/it] 31%|███ | 324/1050 [3:50:53<8:25:56, 41.81s/it] 31%|███ | 325/1050 [3:51:35<8:25:36, 41.84s/it] 31%|███ | 326/1050 [3:52:17<8:25:40, 41.91s/it] 31%|███ | 327/1050 [3:53:00<8:27:16, 42.10s/it] 31%|███ | 328/1050 [3:53:41<8:24:56, 41.96s/it] 31%|███▏ | 329/1050 [3:54:23<8:24:55, 42.02s/it] 31%|███▏ | 330/1050 [3:55:05<8:22:40, 41.89s/it] 31%|███▏ | 330/1050 [3:55:05<8:22:40, 41.89s/it] 32%|███▏ | 331/1050 [3:55:47<8:23:23, 42.01s/it] 32%|███▏ | 332/1050 [3:56:30<8:27:00, 42.37s/it] 32%|███▏ | 333/1050 [3:57:12<8:24:22, 42.21s/it] 32%|███▏ | 334/1050 [3:57:54<8:21:12, 42.00s/it] 32%|███▏ | 335/1050 [3:58:36<8:21:21, 42.07s/it] 32%|███▏ | 336/1050 [3:59:18<8:19:26, 41.97s/it] 32%|███▏ | 337/1050 [4:00:00<8:18:57, 41.99s/it] 32%|███▏ | 338/1050 [4:00:43<8:20:40, 42.19s/it] 32%|███▏ | 339/1050 [4:01:25<8:21:49, 42.35s/it] 32%|███▏ | 340/1050 [4:02:07<8:19:45, 42.23s/it] 32%|███▏ | 340/1050 [4:02:07<8:19:45, 42.23s/it] 32%|███▏ | 341/1050 [4:02:49<8:17:29, 42.10s/it] 33%|███▎ | 342/1050 [4:03:32<8:18:33, 42.25s/it] 33%|███▎ | 343/1050 [4:04:14<8:17:50, 42.25s/it] 33%|███▎ | 344/1050 [4:04:56<8:16:07, 42.16s/it] 33%|███▎ | 345/1050 [4:05:38<8:16:35, 42.26s/it] 33%|███▎ | 346/1050 [4:06:20<8:15:04, 42.19s/it] 33%|███▎ | 347/1050 [4:07:04<8:18:10, 42.52s/it] 33%|███▎ | 348/1050 [4:07:46<8:15:48, 42.38s/it] 33%|███▎ | 349/1050 [4:08:28<8:13:38, 42.25s/it] 33%|███▎ | 350/1050 [4:09:10<8:14:50, 42.41s/it] 33%|███▎ | 350/1050 [4:09:10<8:14:50, 42.41s/it][INFO|trainer.py:3719] 2024-07-16 13:25:13,583 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 13:25:13,583 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 13:25:13,584 >> Batch size = 1 {'eval_loss': 0.24650488793849945, 'eval_accuracy': 0.909, 'eval_runtime': 206.2691, 'eval_samples_per_second': 12.12, 'eval_steps_per_second': 12.12, 'epoch': 1.0} {'loss': 0.2471, 'grad_norm': 0.8761231303215027, 'learning_rate': 9.84538643114539e-05, 'epoch': 1.02} {'loss': 0.2304, 'grad_norm': 1.037913203239441, 'learning_rate': 9.801700234117999e-05, 'epoch': 1.08} {'loss': 0.2414, 'grad_norm': 0.5963209867477417, 'learning_rate': 9.752707744739145e-05, 'epoch': 1.14} {'loss': 0.2276, 'grad_norm': 0.8125211596488953, 'learning_rate': 9.698463103929542e-05, 'epoch': 1.19} {'loss': 0.2137, 'grad_norm': 2.1267828941345215, 'learning_rate': 9.639026256689628e-05, 'epoch': 1.25} {'loss': 0.2467, 'grad_norm': 3.7273290157318115, 'learning_rate': 9.574462885855174e-05, 'epoch': 1.31} {'loss': 0.2682, 'grad_norm': 4.215674877166748, 'learning_rate': 9.504844339512095e-05, 'epoch': 1.36} {'loss': 0.29, 'grad_norm': 1.7884002923965454, 'learning_rate': 9.430247552150673e-05, 'epoch': 1.42} {'loss': 0.2252, 'grad_norm': 1.312519907951355, 'learning_rate': 9.350754959646306e-05, 'epoch': 1.48} {'loss': 0.2408, 'grad_norm': 2.8921542167663574, 'learning_rate': 9.266454408160779e-05, 'epoch': 1.54} {'loss': 0.2171, 'grad_norm': 2.1106109619140625, 'learning_rate': 9.177439057064683e-05, 'epoch': 1.59} {'loss': 0.2394, 'grad_norm': 1.2954068183898926, 'learning_rate': 9.083807275988284e-05, 'epoch': 1.65} {'loss': 0.2307, 'grad_norm': 3.02253794670105, 'learning_rate': 8.985662536114613e-05, 'epoch': 1.71} {'loss': 0.2172, 'grad_norm': 1.063110589981079, 'learning_rate': 8.883113295834892e-05, 'epoch': 1.76} {'loss': 0.2251, 'grad_norm': 1.7286932468414307, 'learning_rate': 8.776272880892675e-05, 'epoch': 1.82} {'loss': 0.2435, 'grad_norm': 1.194043755531311, 'learning_rate': 8.665259359149132e-05, 'epoch': 1.88} {'loss': 0.2425, 'grad_norm': 0.8665173649787903, 'learning_rate': 8.550195410107902e-05, 'epoch': 1.93} {'loss': 0.2297, 'grad_norm': 1.0586192607879639, 'learning_rate': 8.43120818934367e-05, 'epoch': 1.99} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-350 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 13:28:40,491 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 13:28:40,491 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 13:28:40,678 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-350/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 13:28:40,680 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-350/special_tokens_map.json 33%|███▎ | 351/1050 [4:13:21<20:22:38, 104.95s/it] 34%|███▎ | 352/1050 [4:13:59<16:26:37, 84.81s/it] 34%|███▎ | 353/1050 [4:14:42<13:59:18, 72.25s/it] 34%|███▎ | 354/1050 [4:15:23<12:09:06, 62.85s/it] 34%|███▍ | 355/1050 [4:16:06<10:59:06, 56.90s/it] 34%|███▍ | 356/1050 [4:16:48<10:06:13, 52.41s/it] 34%|███▍ | 357/1050 [4:17:30<9:30:51, 49.43s/it] 34%|███▍ | 358/1050 [4:18:12<9:02:09, 47.01s/it] 34%|███▍ | 359/1050 [4:18:54<8:43:50, 45.49s/it] 34%|███▍ | 360/1050 [4:19:36<8:31:37, 44.49s/it] 34%|███▍ | 360/1050 [4:19:36<8:31:37, 44.49s/it] 34%|███▍ | 361/1050 [4:20:18<8:24:26, 43.93s/it] 34%|███▍ | 362/1050 [4:21:02<8:21:06, 43.70s/it] 35%|███▍ | 363/1050 [4:21:44<8:16:13, 43.34s/it] 35%|███▍ | 364/1050 [4:22:28<8:18:04, 43.56s/it] 35%|███▍ | 365/1050 [4:23:11<8:15:14, 43.38s/it] 35%|███▍ | 366/1050 [4:23:53<8:08:46, 42.88s/it] 35%|███▍ | 367/1050 [4:24:35<8:04:25, 42.56s/it] 35%|███▌ | 368/1050 [4:25:17<8:01:34, 42.37s/it] 35%|███▌ | 369/1050 [4:25:59<8:01:25, 42.42s/it] 35%|███▌ | 370/1050 [4:26:41<7:58:27, 42.22s/it] 35%|███▌ | 370/1050 [4:26:41<7:58:27, 42.22s/it] 35%|███▌ | 371/1050 [4:27:23<7:58:35, 42.29s/it] 35%|███▌ | 372/1050 [4:28:06<7:59:51, 42.47s/it] 36%|███▌ | 373/1050 [4:28:48<7:57:34, 42.33s/it] 36%|███▌ | 374/1050 [4:29:31<7:58:31, 42.47s/it] 36%|███▌ | 375/1050 [4:30:13<7:57:06, 42.41s/it] 36%|███▌ | 376/1050 [4:30:55<7:53:00, 42.11s/it] 36%|███▌ | 377/1050 [4:31:37<7:53:19, 42.20s/it] 36%|███▌ | 378/1050 [4:32:19<7:52:36, 42.20s/it] 36%|███▌ | 379/1050 [4:33:02<7:55:01, 42.48s/it] 36%|███▌ | 380/1050 [4:33:44<7:52:32, 42.32s/it] 36%|███▌ | 380/1050 [4:33:44<7:52:32, 42.32s/it] 36%|███▋ | 381/1050 [4:34:27<7:51:52, 42.32s/it] 36%|███▋ | 382/1050 [4:35:08<7:48:40, 42.10s/it] 36%|███▋ | 383/1050 [4:35:50<7:47:38, 42.07s/it] 37%|███▋ | 384/1050 [4:36:32<7:45:09, 41.91s/it] 37%|███▋ | 385/1050 [4:37:14<7:45:49, 42.03s/it] 37%|███▋ | 386/1050 [4:37:56<7:45:39, 42.08s/it] 37%|███▋ | 387/1050 [4:38:39<7:45:25, 42.12s/it] 37%|███▋ | 388/1050 [4:39:20<7:43:03, 41.97s/it] 37%|███▋ | 389/1050 [4:40:02<7:42:33, 41.99s/it] 37%|███▋ | 390/1050 [4:40:45<7:43:17, 42.12s/it] 37%|███▋ | 390/1050 [4:40:45<7:43:17, 42.12s/it] 37%|███▋ | 391/1050 [4:41:27<7:42:25, 42.10s/it] 37%|███▋ | 392/1050 [4:42:10<7:47:17, 42.61s/it] 37%|███▋ | 393/1050 [4:42:52<7:44:20, 42.41s/it] 38%|███▊ | 394/1050 [4:43:35<7:42:54, 42.34s/it] 38%|███▊ | 395/1050 [4:44:17<7:43:55, 42.50s/it] 38%|███▊ | 396/1050 [4:44:58<7:38:17, 42.05s/it] 38%|███▊ | 397/1050 [4:45:40<7:36:47, 41.97s/it] 38%|███▊ | 398/1050 [4:46:22<7:36:13, 41.98s/it] 38%|███▊ | 399/1050 [4:47:04<7:33:52, 41.83s/it] 38%|███▊ | 400/1050 [4:47:45<7:31:12, 41.65s/it] 38%|███▊ | 400/1050 [4:47:45<7:31:12, 41.65s/it] 38%|███▊ | 401/1050 [4:48:27<7:31:03, 41.70s/it] 38%|███▊ | 402/1050 [4:49:08<7:29:17, 41.60s/it] 38%|███▊ | 403/1050 [4:49:50<7:29:26, 41.68s/it] 38%|███▊ | 404/1050 [4:50:33<7:33:22, 42.11s/it] 39%|███▊ | 405/1050 [4:51:15<7:32:11, 42.06s/it] 39%|███▊ | 406/1050 [4:51:57<7:30:27, 41.97s/it] 39%|███▉ | 407/1050 [4:52:40<7:33:02, 42.27s/it] 39%|███▉ | 408/1050 [4:53:23<7:34:34, 42.48s/it] 39%|███▉ | 409/1050 [4:54:06<7:35:00, 42.59s/it] 39%|███▉ | 410/1050 [4:54:48<7:34:45, 42.63s/it] 39%|███▉ | 410/1050 [4:54:48<7:34:45, 42.63s/it] 39%|███▉ | 411/1050 [4:55:31<7:32:56, 42.53s/it] 39%|███▉ | 412/1050 [4:56:13<7:32:28, 42.55s/it] 39%|███▉ | 413/1050 [4:56:56<7:31:05, 42.49s/it] 39%|███▉ | 414/1050 [4:57:38<7:30:18, 42.48s/it] 40%|███▉ | 415/1050 [4:58:19<7:25:19, 42.08s/it] 40%|███▉ | 416/1050 [4:59:01<7:25:03, 42.12s/it] 40%|███▉ | 417/1050 [4:59:43<7:22:39, 41.96s/it] 40%|███▉ | 418/1050 [5:00:25<7:20:42, 41.84s/it] 40%|███▉ | 419/1050 [5:01:06<7:20:13, 41.86s/it] 40%|████ | 420/1050 [5:01:49<7:22:41, 42.16s/it] 40%|████ | 420/1050 [5:01:49<7:22:41, 42.16s/it] 40%|████ | 421/1050 [5:02:32<7:23:46, 42.33s/it] 40%|████ | 422/1050 [5:03:14<7:22:32, 42.28s/it] 40%|████ | 423/1050 [5:03:56<7:19:17, 42.04s/it] 40%|████ | 424/1050 [5:04:38<7:19:47, 42.15s/it] 40%|████ | 425/1050 [5:05:20<7:19:41, 42.21s/it] 41%|████ | 426/1050 [5:06:03<7:20:55, 42.40s/it] 41%|████ | 427/1050 [5:06:45<7:17:43, 42.16s/it] 41%|████ | 428/1050 [5:07:26<7:14:46, 41.94s/it] 41%|████ | 429/1050 [5:08:09<7:16:02, 42.13s/it] 41%|████ | 430/1050 [5:08:50<7:13:32, 41.96s/it] 41%|████ | 430/1050 [5:08:50<7:13:32, 41.96s/it] 41%|████ | 431/1050 [5:09:34<7:18:21, 42.49s/it] 41%|████ | 432/1050 [5:10:15<7:13:50, 42.12s/it] 41%|████ | 433/1050 [5:10:58<7:13:59, 42.20s/it] 41%|████▏ | 434/1050 [5:11:41<7:16:49, 42.55s/it] 41%|████▏ | 435/1050 [5:12:23<7:13:18, 42.27s/it] 42%|████▏ | 436/1050 [5:13:05<7:12:44, 42.29s/it] 42%|████▏ | 437/1050 [5:13:46<7:08:14, 41.92s/it] 42%|████▏ | 438/1050 [5:14:29<7:09:08, 42.07s/it] 42%|████▏ | 439/1050 [5:15:10<7:06:21, 41.87s/it] 42%|████▏ | 440/1050 [5:15:52<7:04:54, 41.79s/it] 42%|████▏ | 440/1050 [5:15:52<7:04:54, 41.79s/it] 42%|████▏ | 441/1050 [5:16:34<7:04:47, 41.85s/it] 42%|████▏ | 442/1050 [5:17:16<7:05:33, 42.00s/it] 42%|████▏ | 443/1050 [5:17:57<7:01:43, 41.69s/it] 42%|████▏ | 444/1050 [5:18:38<7:00:18, 41.61s/it] 42%|████▏ | 445/1050 [5:19:22<7:05:46, 42.23s/it] 42%|████▏ | 446/1050 [5:20:04<7:05:27, 42.26s/it] 43%|████▎ | 447/1050 [5:20:46<7:02:30, 42.04s/it] 43%|████▎ | 448/1050 [5:21:28<7:01:34, 42.02s/it] 43%|████▎ | 449/1050 [5:22:10<7:00:53, 42.02s/it] 43%|████▎ | 450/1050 [5:22:53<7:02:47, 42.28s/it] 43%|████▎ | 450/1050 [5:22:53<7:02:47, 42.28s/it] 43%|████▎ | 451/1050 [5:23:35<7:01:42, 42.24s/it] 43%|████▎ | 452/1050 [5:24:17<7:00:39, 42.21s/it] 43%|████▎ | 453/1050 [5:24:59<6:57:57, 42.01s/it] 43%|████▎ | 454/1050 [5:25:42<7:01:10, 42.40s/it] 43%|████▎ | 455/1050 [5:26:25<7:01:47, 42.53s/it] 43%|████▎ | 456/1050 [5:27:07<7:00:17, 42.45s/it] 44%|████▎ | 457/1050 [5:27:49<6:59:26, 42.44s/it] 44%|████▎ | 458/1050 [5:28:32<6:58:19, 42.40s/it] 44%|████▎ | 459/1050 [5:29:14<6:58:37, 42.50s/it] 44%|████▍ | 460/1050 [5:29:57<6:58:25, 42.55s/it] 44%|████▍ | 460/1050 [5:29:57<6:58:25, 42.55s/it] 44%|████▍ | 461/1050 [5:30:39<6:55:54, 42.37s/it] 44%|████▍ | 462/1050 [5:31:22<6:55:52, 42.44s/it] 44%|████▍ | 463/1050 [5:32:03<6:52:50, 42.20s/it] 44%|████▍ | 464/1050 [5:32:46<6:52:44, 42.26s/it] 44%|████▍ | 465/1050 [5:33:28<6:50:59, 42.15s/it] 44%|████▍ | 466/1050 [5:34:10<6:50:03, 42.13s/it] 44%|████▍ | 467/1050 [5:34:52<6:49:26, 42.14s/it] 45%|████▍ | 468/1050 [5:35:34<6:47:58, 42.06s/it] 45%|████▍ | 469/1050 [5:36:16<6:47:59, 42.13s/it] 45%|████▍ | 470/1050 [5:36:57<6:45:17, 41.93s/it] 45%|████▍ | 470/1050 [5:36:57<6:45:17, 41.93s/it] 45%|████▍ | 471/1050 [5:37:41<6:48:03, 42.29s/it] 45%|████▍ | 472/1050 [5:38:23<6:46:51, 42.23s/it] 45%|████▌ | 473/1050 [5:39:05<6:45:29, 42.16s/it] 45%|████▌ | 474/1050 [5:39:47<6:45:33, 42.25s/it] 45%|████▌ | 475/1050 [5:40:29<6:43:00, 42.05s/it] 45%|████▌ | 476/1050 [5:41:12<6:45:20, 42.37s/it] 45%|████▌ | 477/1050 [5:41:54<6:44:28, 42.35s/it] 46%|████▌ | 478/1050 [5:42:37<6:44:34, 42.44s/it] 46%|████▌ | 479/1050 [5:43:18<6:41:31, 42.19s/it] 46%|████▌ | 480/1050 [5:44:00<6:39:42, 42.07s/it] 46%|████▌ | 480/1050 [5:44:00<6:39:42, 42.07s/it] 46%|████▌ | 481/1050 [5:44:42<6:39:20, 42.11s/it] 46%|████▌ | 482/1050 [5:45:24<6:36:57, 41.93s/it] 46%|████▌ | 483/1050 [5:46:06<6:36:27, 41.95s/it] 46%|████▌ | 484/1050 [5:46:48<6:35:03, 41.88s/it] 46%|████▌ | 485/1050 [5:47:30<6:35:46, 42.03s/it] 46%|████▋ | 486/1050 [5:48:13<6:37:46, 42.32s/it] 46%|████▋ | 487/1050 [5:48:54<6:34:12, 42.01s/it] 46%|████▋ | 488/1050 [5:49:38<6:37:08, 42.40s/it] 47%|████▋ | 489/1050 [5:50:20<6:35:39, 42.32s/it] 47%|████▋ | 490/1050 [5:51:02<6:34:34, 42.28s/it] 47%|████▋ | 490/1050 [5:51:02<6:34:34, 42.28s/it] 47%|████▋ | 491/1050 [5:51:44<6:32:47, 42.16s/it] 47%|████▋ | 492/1050 [5:52:27<6:35:33, 42.53s/it] 47%|████▋ | 493/1050 [5:53:10<6:36:10, 42.68s/it] 47%|████▋ | 494/1050 [5:53:52<6:32:34, 42.36s/it] 47%|████▋ | 495/1050 [5:54:34<6:31:57, 42.37s/it] 47%|████▋ | 496/1050 [5:55:16<6:29:16, 42.16s/it] 47%|████▋ | 497/1050 [5:55:58<6:29:13, 42.23s/it] 47%|████▋ | 498/1050 [5:56:40<6:26:21, 41.99s/it] 48%|████▊ | 499/1050 [5:57:22<6:26:08, 42.05s/it] 48%|████▊ | 500/1050 [5:58:05<6:27:14, 42.24s/it] 48%|████▊ | 500/1050 [5:58:05<6:27:14, 42.24s/it] 48%|████▊ | 501/1050 [5:58:47<6:27:16, 42.33s/it] 48%|████▊ | 502/1050 [5:59:29<6:24:18, 42.08s/it] 48%|████▊ | 503/1050 [6:00:11<6:23:11, 42.03s/it] 48%|████▊ | 504/1050 [6:00:52<6:21:48, 41.96s/it] 48%|████▊ | 505/1050 [6:01:34<6:21:39, 42.02s/it] 48%|████▊ | 506/1050 [6:02:17<6:22:33, 42.19s/it] 48%|████▊ | 507/1050 [6:02:59<6:21:51, 42.19s/it] 48%|████▊ | 508/1050 [6:03:41<6:19:39, 42.03s/it] 48%|████▊ | 509/1050 [6:04:24<6:20:34, 42.21s/it] 49%|████▊ | 510/1050 [6:05:06<6:20:57, 42.33s/it] 49%|████▊ | 510/1050 [6:05:06<6:20:57, 42.33s/it] 49%|████▊ | 511/1050 [6:05:49<6:21:26, 42.46s/it] 49%|████▉ | 512/1050 [6:06:32<6:22:23, 42.65s/it] 49%|████▉ | 513/1050 [6:07:14<6:20:21, 42.50s/it] 49%|████▉ | 514/1050 [6:07:57<6:19:12, 42.45s/it] 49%|████▉ | 515/1050 [6:08:38<6:17:12, 42.30s/it] 49%|████▉ | 516/1050 [6:09:20<6:15:26, 42.18s/it] 49%|████▉ | 517/1050 [6:10:02<6:14:06, 42.11s/it] 49%|████▉ | 518/1050 [6:10:45<6:15:04, 42.30s/it] 49%|████▉ | 519/1050 [6:11:27<6:14:25, 42.31s/it] 50%|████▉ | 520/1050 [6:12:10<6:14:10, 42.36s/it] 50%|████▉ | 520/1050 [6:12:10<6:14:10, 42.36s/it] 50%|████▉ | 521/1050 [6:12:52<6:13:15, 42.34s/it] 50%|████▉ | 522/1050 [6:13:35<6:13:31, 42.45s/it] 50%|████▉ | 523/1050 [6:14:16<6:10:36, 42.20s/it] 50%|████▉ | 524/1050 [6:14:59<6:10:44, 42.29s/it] 50%|█████ | 525/1050 [6:15:41<6:09:20, 42.21s/it][INFO|trainer.py:3719] 2024-07-16 15:31:44,177 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 15:31:44,177 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 15:31:44,177 >> Batch size = 1 {'eval_loss': 0.24609853327274323, 'eval_accuracy': 0.9031333333333333, 'eval_runtime': 206.1828, 'eval_samples_per_second': 12.125, 'eval_steps_per_second': 12.125, 'epoch': 1.99} {'loss': 0.1928, 'grad_norm': 1.062510371208191, 'learning_rate': 8.308429187984297e-05, 'epoch': 2.05} {'loss': 0.194, 'grad_norm': 1.1897845268249512, 'learning_rate': 8.181994087401819e-05, 'epoch': 2.1} {'loss': 0.1699, 'grad_norm': 1.3862345218658447, 'learning_rate': 8.052042609272817e-05, 'epoch': 2.16} {'loss': 0.186, 'grad_norm': 0.9099957346916199, 'learning_rate': 7.91871836117395e-05, 'epoch': 2.22} {'loss': 0.1711, 'grad_norm': 1.0390052795410156, 'learning_rate': 7.782168677883206e-05, 'epoch': 2.27} {'loss': 0.2012, 'grad_norm': 2.652061700820923, 'learning_rate': 7.642544458562278e-05, 'epoch': 2.33} {'loss': 0.1898, 'grad_norm': 1.5696091651916504, 'learning_rate': 7.500000000000001e-05, 'epoch': 2.39} {'loss': 0.2032, 'grad_norm': 0.8261299729347229, 'learning_rate': 7.354692826101102e-05, 'epoch': 2.44} {'loss': 0.1854, 'grad_norm': 1.252537727355957, 'learning_rate': 7.20678351380872e-05, 'epoch': 2.5} {'loss': 0.202, 'grad_norm': 1.3897773027420044, 'learning_rate': 7.056435515653059e-05, 'epoch': 2.56} {'loss': 0.1789, 'grad_norm': 0.8310902714729309, 'learning_rate': 6.903814979122249e-05, 'epoch': 2.62} {'loss': 0.1969, 'grad_norm': 1.3531970977783203, 'learning_rate': 6.749090563055076e-05, 'epoch': 2.67} {'loss': 0.1967, 'grad_norm': 1.0367170572280884, 'learning_rate': 6.592433251258423e-05, 'epoch': 2.73} {'loss': 0.1825, 'grad_norm': 1.2085515260696411, 'learning_rate': 6.434016163555452e-05, 'epoch': 2.79} {'loss': 0.212, 'grad_norm': 1.2390882968902588, 'learning_rate': 6.274014364473274e-05, 'epoch': 2.84} {'loss': 0.191, 'grad_norm': 1.4294568300247192, 'learning_rate': 6.112604669781572e-05, 'epoch': 2.9} {'loss': 0.1792, 'grad_norm': 1.522834300994873, 'learning_rate': 5.949965451095951e-05, 'epoch': 2.96} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-525 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 15:35:11,769 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 15:35:11,770 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 15:35:11,956 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-525/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 15:35:11,958 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-525/special_tokens_map.json 50%|█████ | 526/1050 [6:19:53<15:17:37, 105.07s/it] 50%|█████ | 527/1050 [6:20:35<12:31:00, 86.16s/it] 50%|█████ | 528/1050 [6:21:13<10:25:04, 71.85s/it] 50%|█████ | 529/1050 [6:21:56<9:07:18, 63.03s/it] 50%|█████ | 530/1050 [6:22:38<8:11:55, 56.76s/it] 50%|█████ | 530/1050 [6:22:38<8:11:55, 56.76s/it] 51%|█████ | 531/1050 [6:23:20<7:34:15, 52.52s/it] 51%|█████ | 532/1050 [6:24:02<7:05:55, 49.34s/it] 51%|█████ | 533/1050 [6:24:45<6:48:51, 47.45s/it] 51%|█████ | 534/1050 [6:25:27<6:33:24, 45.74s/it] 51%|█████ | 535/1050 [6:26:08<6:20:59, 44.39s/it] 51%|█████ | 536/1050 [6:26:50<6:13:50, 43.64s/it] 51%|█████ | 537/1050 [6:27:33<6:09:59, 43.27s/it] 51%|█████ | 538/1050 [6:28:14<6:05:04, 42.78s/it] 51%|█████▏ | 539/1050 [6:28:57<6:04:14, 42.77s/it] 51%|█████▏ | 540/1050 [6:29:40<6:03:15, 42.74s/it] 51%|█████▏ | 540/1050 [6:29:40<6:03:15, 42.74s/it] 52%|█████▏ | 541/1050 [6:30:21<5:59:43, 42.40s/it] 52%|█████▏ | 542/1050 [6:31:05<6:03:24, 42.92s/it] 52%|█████▏ | 543/1050 [6:31:47<5:58:10, 42.39s/it] 52%|█████▏ | 544/1050 [6:32:29<5:57:25, 42.38s/it] 52%|█████▏ | 545/1050 [6:33:12<5:58:26, 42.59s/it] 52%|█████▏ | 546/1050 [6:33:54<5:55:58, 42.38s/it] 52%|█████▏ | 547/1050 [6:34:37<5:56:53, 42.57s/it] 52%|█████▏ | 548/1050 [6:35:21<5:59:36, 42.98s/it] 52%|█████▏ | 549/1050 [6:36:03<5:55:55, 42.63s/it] 52%|█████▏ | 550/1050 [6:36:44<5:51:51, 42.22s/it] 52%|█████▏ | 550/1050 [6:36:44<5:51:51, 42.22s/it] 52%|█████▏ | 551/1050 [6:37:27<5:51:53, 42.31s/it] 53%|█████▎ | 552/1050 [6:38:09<5:52:42, 42.49s/it] 53%|█████▎ | 553/1050 [6:38:51<5:49:54, 42.24s/it] 53%|█████▎ | 554/1050 [6:39:34<5:50:38, 42.42s/it] 53%|█████▎ | 555/1050 [6:40:16<5:50:09, 42.44s/it] 53%|█████▎ | 556/1050 [6:40:59<5:50:00, 42.51s/it] 53%|█████▎ | 557/1050 [6:41:41<5:47:10, 42.25s/it] 53%|█████▎ | 558/1050 [6:42:23<5:46:47, 42.29s/it] 53%|█████▎ | 559/1050 [6:43:05<5:44:30, 42.10s/it] 53%|█████▎ | 560/1050 [6:43:47<5:45:13, 42.27s/it] 53%|█████▎ | 560/1050 [6:43:47<5:45:13, 42.27s/it] 53%|█████▎ | 561/1050 [6:44:30<5:45:11, 42.36s/it] 54%|█████▎ | 562/1050 [6:45:11<5:41:13, 41.95s/it] 54%|█████▎ | 563/1050 [6:45:53<5:40:33, 41.96s/it] 54%|█████▎ | 564/1050 [6:46:35<5:40:13, 42.00s/it] 54%|█████▍ | 565/1050 [6:47:17<5:40:20, 42.10s/it] 54%|█████▍ | 566/1050 [6:48:00<5:40:10, 42.17s/it] 54%|█████▍ | 567/1050 [6:48:41<5:36:47, 41.84s/it] 54%|█████▍ | 568/1050 [6:49:23<5:37:44, 42.04s/it] 54%|█████▍ | 569/1050 [6:50:05<5:36:27, 41.97s/it] 54%|█████▍ | 570/1050 [6:50:48<5:38:24, 42.30s/it] 54%|█████▍ | 570/1050 [6:50:48<5:38:24, 42.30s/it] 54%|█████▍ | 571/1050 [6:51:30<5:37:25, 42.27s/it] 54%|█████▍ | 572/1050 [6:52:13<5:36:21, 42.22s/it] 55%|█████▍ | 573/1050 [6:52:55<5:35:52, 42.25s/it] 55%|█████▍ | 574/1050 [6:53:37<5:34:35, 42.17s/it] 55%|█████▍ | 575/1050 [6:54:17<5:29:45, 41.65s/it] 55%|█████▍ | 576/1050 [6:55:01<5:33:03, 42.16s/it] 55%|█████▍ | 577/1050 [6:55:43<5:33:49, 42.35s/it] 55%|█████▌ | 578/1050 [6:56:26<5:33:12, 42.36s/it] 55%|█████▌ | 579/1050 [6:57:08<5:31:25, 42.22s/it] 55%|█████▌ | 580/1050 [6:57:50<5:30:56, 42.25s/it] 55%|█████▌ | 580/1050 [6:57:50<5:30:56, 42.25s/it] 55%|█████▌ | 581/1050 [6:58:33<5:32:02, 42.48s/it] 55%|█████▌ | 582/1050 [6:59:15<5:30:17, 42.34s/it] 56%|█████▌ | 583/1050 [6:59:57<5:28:50, 42.25s/it] 56%|█████▌ | 584/1050 [7:00:40<5:28:39, 42.32s/it] 56%|█████▌ | 585/1050 [7:01:21<5:26:22, 42.11s/it] 56%|█████▌ | 586/1050 [7:02:03<5:25:16, 42.06s/it] 56%|█████▌ | 587/1050 [7:02:46<5:26:35, 42.32s/it] 56%|█████▌ | 588/1050 [7:03:28<5:25:40, 42.30s/it] 56%|█████▌ | 589/1050 [7:04:10<5:23:53, 42.16s/it] 56%|█████▌ | 590/1050 [7:04:52<5:23:15, 42.16s/it] 56%|█████▌ | 590/1050 [7:04:52<5:23:15, 42.16s/it] 56%|█████▋ | 591/1050 [7:05:34<5:20:54, 41.95s/it] 56%|█████▋ | 592/1050 [7:06:17<5:22:23, 42.23s/it] 56%|█████▋ | 593/1050 [7:06:58<5:20:14, 42.04s/it] 57%|█████▋ | 594/1050 [7:07:41<5:20:16, 42.14s/it] 57%|█████▋ | 595/1050 [7:08:22<5:18:17, 41.97s/it] 57%|█████▋ | 596/1050 [7:09:04<5:16:35, 41.84s/it] 57%|█████▋ | 597/1050 [7:09:46<5:17:28, 42.05s/it] 57%|█████▋ | 598/1050 [7:10:28<5:15:38, 41.90s/it] 57%|█████▋ | 599/1050 [7:11:11<5:18:22, 42.36s/it] 57%|█████▋ | 600/1050 [7:11:54<5:18:40, 42.49s/it] 57%|█████▋ | 600/1050 [7:11:54<5:18:40, 42.49s/it] 57%|█████▋ | 601/1050 [7:12:37<5:19:01, 42.63s/it] 57%|█████▋ | 602/1050 [7:13:19<5:17:39, 42.54s/it] 57%|█████▋ | 603/1050 [7:14:01<5:14:50, 42.26s/it] 58%|█████▊ | 604/1050 [7:14:43<5:14:33, 42.32s/it] 58%|█████▊ | 605/1050 [7:15:26<5:15:06, 42.49s/it] 58%|█████▊ | 606/1050 [7:16:09<5:14:07, 42.45s/it] 58%|█████▊ | 607/1050 [7:16:51<5:12:32, 42.33s/it] 58%|█████▊ | 608/1050 [7:17:34<5:13:17, 42.53s/it] 58%|█████▊ | 609/1050 [7:18:16<5:11:26, 42.37s/it] 58%|█████▊ | 610/1050 [7:18:57<5:07:57, 41.99s/it] 58%|█████▊ | 610/1050 [7:18:57<5:07:57, 41.99s/it] 58%|█████▊ | 611/1050 [7:19:40<5:09:17, 42.27s/it] 58%|█████▊ | 612/1050 [7:20:24<5:13:53, 43.00s/it] 58%|█████▊ | 613/1050 [7:21:07<5:11:32, 42.78s/it] 58%|█████▊ | 614/1050 [7:21:48<5:08:38, 42.47s/it] 59%|█████▊ | 615/1050 [7:22:30<5:05:05, 42.08s/it] 59%|█████▊ | 616/1050 [7:23:12<5:04:52, 42.15s/it] 59%|█████▉ | 617/1050 [7:23:53<5:03:01, 41.99s/it] 59%|█████▉ | 618/1050 [7:24:34<4:59:24, 41.58s/it] 59%|█████▉ | 619/1050 [7:25:17<5:00:45, 41.87s/it] 59%|█████▉ | 620/1050 [7:25:59<5:01:57, 42.13s/it] 59%|█████▉ | 620/1050 [7:25:59<5:01:57, 42.13s/it] 59%|█████▉ | 621/1050 [7:26:41<5:00:02, 41.96s/it] 59%|█████▉ | 622/1050 [7:27:22<4:57:56, 41.77s/it] 59%|█████▉ | 623/1050 [7:28:05<4:59:10, 42.04s/it] 59%|█████▉ | 624/1050 [7:28:48<5:00:30, 42.33s/it] 60%|█████▉ | 625/1050 [7:29:30<4:58:36, 42.16s/it] 60%|█████▉ | 626/1050 [7:30:12<4:58:41, 42.27s/it] 60%|█████▉ | 627/1050 [7:30:54<4:56:53, 42.11s/it] 60%|█████▉ | 628/1050 [7:31:36<4:56:29, 42.16s/it] 60%|█████▉ | 629/1050 [7:32:18<4:54:21, 41.95s/it] 60%|██████ | 630/1050 [7:32:59<4:53:06, 41.87s/it] 60%|██████ | 630/1050 [7:32:59<4:53:06, 41.87s/it] 60%|██████ | 631/1050 [7:33:42<4:53:04, 41.97s/it] 60%|██████ | 632/1050 [7:34:25<4:55:03, 42.35s/it] 60%|██████ | 633/1050 [7:35:08<4:55:59, 42.59s/it] 60%|██████ | 634/1050 [7:35:50<4:53:58, 42.40s/it] 60%|██████ | 635/1050 [7:36:32<4:51:56, 42.21s/it] 61%|██████ | 636/1050 [7:37:14<4:51:48, 42.29s/it] 61%|██████ | 637/1050 [7:37:56<4:50:40, 42.23s/it] 61%|██████ | 638/1050 [7:38:39<4:51:50, 42.50s/it] 61%|██████ | 639/1050 [7:39:21<4:48:52, 42.17s/it] 61%|██████ | 640/1050 [7:40:03<4:48:16, 42.19s/it] 61%|██████ | 640/1050 [7:40:03<4:48:16, 42.19s/it] 61%|██████ | 641/1050 [7:40:45<4:47:19, 42.15s/it] 61%|██████ | 642/1050 [7:41:27<4:46:14, 42.09s/it] 61%|██████ | 643/1050 [7:42:09<4:44:23, 41.92s/it] 61%|██████▏ | 644/1050 [7:42:51<4:44:58, 42.11s/it] 61%|██████▏ | 645/1050 [7:43:34<4:46:05, 42.38s/it] 62%|██████▏ | 646/1050 [7:44:16<4:44:01, 42.18s/it] 62%|██████▏ | 647/1050 [7:44:59<4:45:00, 42.43s/it] 62%|██████▏ | 648/1050 [7:45:43<4:46:56, 42.83s/it] 62%|██████▏ | 649/1050 [7:46:25<4:44:43, 42.60s/it] 62%|██████▏ | 650/1050 [7:47:07<4:42:24, 42.36s/it] 62%|██████▏ | 650/1050 [7:47:07<4:42:24, 42.36s/it] 62%|██████▏ | 651/1050 [7:47:49<4:41:44, 42.37s/it] 62%|██████▏ | 652/1050 [7:48:31<4:39:40, 42.16s/it] 62%|██████▏ | 653/1050 [7:49:12<4:37:30, 41.94s/it] 62%|██████▏ | 654/1050 [7:49:55<4:38:46, 42.24s/it] 62%|██████▏ | 655/1050 [7:50:36<4:36:35, 42.01s/it] 62%|██████▏ | 656/1050 [7:51:20<4:38:50, 42.46s/it] 63%|██████▎ | 657/1050 [7:52:02<4:37:07, 42.31s/it] 63%|██████▎ | 658/1050 [7:52:45<4:36:58, 42.39s/it] 63%|██████▎ | 659/1050 [7:53:27<4:35:58, 42.35s/it] 63%|██████▎ | 660/1050 [7:54:09<4:34:29, 42.23s/it] 63%|██████▎ | 660/1050 [7:54:09<4:34:29, 42.23s/it] 63%|██████▎ | 661/1050 [7:54:51<4:33:19, 42.16s/it] 63%|██████▎ | 662/1050 [7:55:33<4:32:19, 42.11s/it] 63%|██████▎ | 663/1050 [7:56:15<4:32:16, 42.21s/it] 63%|██████▎ | 664/1050 [7:56:57<4:31:01, 42.13s/it] 63%|██████▎ | 665/1050 [7:57:39<4:29:21, 41.98s/it] 63%|██████▎ | 666/1050 [7:58:21<4:28:49, 42.00s/it] 64%|██████▎ | 667/1050 [7:59:02<4:27:01, 41.83s/it] 64%|██████▎ | 668/1050 [7:59:46<4:29:25, 42.32s/it] 64%|██████▎ | 669/1050 [8:00:27<4:27:22, 42.11s/it] 64%|██████▍ | 670/1050 [8:01:09<4:26:05, 42.02s/it] 64%|██████▍ | 670/1050 [8:01:09<4:26:05, 42.02s/it] 64%|██████▍ | 671/1050 [8:01:51<4:25:22, 42.01s/it] 64%|██████▍ | 672/1050 [8:02:32<4:23:22, 41.81s/it] 64%|██████▍ | 673/1050 [8:03:15<4:23:55, 42.00s/it] 64%|██████▍ | 674/1050 [8:03:56<4:22:02, 41.82s/it] 64%|██████▍ | 675/1050 [8:04:40<4:25:31, 42.48s/it] 64%|██████▍ | 676/1050 [8:05:22<4:22:36, 42.13s/it] 64%|██████▍ | 677/1050 [8:06:04<4:23:09, 42.33s/it] 65%|██████▍ | 678/1050 [8:06:47<4:22:25, 42.33s/it] 65%|██████▍ | 679/1050 [8:07:29<4:21:00, 42.21s/it] 65%|██████▍ | 680/1050 [8:08:11<4:20:50, 42.30s/it] 65%|██████▍ | 680/1050 [8:08:11<4:20:50, 42.30s/it] 65%|██████▍ | 681/1050 [8:08:53<4:19:03, 42.12s/it] 65%|██████▍ | 682/1050 [8:09:34<4:17:27, 41.98s/it] 65%|██████▌ | 683/1050 [8:10:17<4:17:44, 42.14s/it] 65%|██████▌ | 684/1050 [8:10:59<4:16:58, 42.13s/it] 65%|██████▌ | 685/1050 [8:11:41<4:15:29, 42.00s/it] 65%|██████▌ | 686/1050 [8:12:22<4:13:13, 41.74s/it] 65%|██████▌ | 687/1050 [8:13:05<4:15:11, 42.18s/it] 66%|██████▌ | 688/1050 [8:13:47<4:13:54, 42.09s/it] 66%|██████▌ | 689/1050 [8:14:30<4:14:14, 42.26s/it] 66%|██████▌ | 690/1050 [8:15:12<4:13:23, 42.23s/it] 66%|██████▌ | 690/1050 [8:15:12<4:13:23, 42.23s/it] 66%|██████▌ | 691/1050 [8:15:54<4:12:15, 42.16s/it] 66%|██████▌ | 692/1050 [8:16:36<4:10:57, 42.06s/it] 66%|██████▌ | 693/1050 [8:17:18<4:09:52, 41.99s/it] 66%|██████▌ | 694/1050 [8:18:00<4:09:57, 42.13s/it] 66%|██████▌ | 695/1050 [8:18:42<4:08:41, 42.03s/it] 66%|██████▋ | 696/1050 [8:19:24<4:07:44, 41.99s/it] 66%|██████▋ | 697/1050 [8:20:07<4:08:56, 42.31s/it] 66%|██████▋ | 698/1050 [8:20:48<4:07:07, 42.12s/it] 67%|██████▋ | 699/1050 [8:21:30<4:05:08, 41.90s/it] 67%|██████▋ | 700/1050 [8:22:12<4:05:38, 42.11s/it] 67%|██████▋ | 700/1050 [8:22:12<4:05:38, 42.11s/it][INFO|trainer.py:3719] 2024-07-16 17:38:15,569 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 17:38:15,569 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 17:38:15,569 >> Batch size = 1 {'eval_loss': 0.25212058424949646, 'eval_accuracy': 0.9013333333333332, 'eval_runtime': 206.7127, 'eval_samples_per_second': 12.094, 'eval_steps_per_second': 12.094, 'epoch': 2.99} {'loss': 0.1592, 'grad_norm': 0.655222475528717, 'learning_rate': 5.786276438761927e-05, 'epoch': 3.01} {'loss': 0.1186, 'grad_norm': 0.8747724294662476, 'learning_rate': 5.621718523237427e-05, 'epoch': 3.07} {'loss': 0.1195, 'grad_norm': 1.3602584600448608, 'learning_rate': 5.456473555193242e-05, 'epoch': 3.13} {'loss': 0.144, 'grad_norm': 1.0027141571044922, 'learning_rate': 5.290724144552379e-05, 'epoch': 3.18} {'loss': 0.1508, 'grad_norm': 2.7845115661621094, 'learning_rate': 5.124653458690365e-05, 'epoch': 3.24} {'loss': 0.1399, 'grad_norm': 0.7762941718101501, 'learning_rate': 4.9584450200195156e-05, 'epoch': 3.3} {'loss': 0.1321, 'grad_norm': 0.9395729303359985, 'learning_rate': 4.792282503180867e-05, 'epoch': 3.35} {'loss': 0.1245, 'grad_norm': 0.9491863250732422, 'learning_rate': 4.626349532067879e-05, 'epoch': 3.41} {'loss': 0.1211, 'grad_norm': 1.0270098447799683, 'learning_rate': 4.4608294769062075e-05, 'epoch': 3.47} {'loss': 0.1336, 'grad_norm': 1.0683083534240723, 'learning_rate': 4.295905251613817e-05, 'epoch': 3.53} {'loss': 0.1379, 'grad_norm': 1.0028445720672607, 'learning_rate': 4.131759111665349e-05, 'epoch': 3.58} {'loss': 0.1248, 'grad_norm': 0.6078014969825745, 'learning_rate': 3.968572452684113e-05, 'epoch': 3.64} {'loss': 0.1244, 'grad_norm': 1.0846917629241943, 'learning_rate': 3.806525609984312e-05, 'epoch': 3.7} {'loss': 0.1455, 'grad_norm': 1.1056978702545166, 'learning_rate': 3.6457976592849754e-05, 'epoch': 3.75} {'loss': 0.1381, 'grad_norm': 0.8466194868087769, 'learning_rate': 3.486566218815871e-05, 'epoch': 3.81} {'loss': 0.1278, 'grad_norm': 1.3114445209503174, 'learning_rate': 3.329007253034063e-05, 'epoch': 3.87} {'loss': 0.1487, 'grad_norm': 1.2287284135818481, 'learning_rate': 3.173294878168025e-05, 'epoch': 3.92} {'loss': 0.122, 'grad_norm': 0.7370882630348206, 'learning_rate': 3.019601169804216e-05, 'epoch': 3.98} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-700 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 17:41:43,905 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 17:41:43,906 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 17:41:44,086 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-700/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 17:41:44,087 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-700/special_tokens_map.json 67%|██████▋ | 701/1050 [8:26:24<10:10:30, 104.96s/it] 67%|██████▋ | 702/1050 [8:27:06<8:19:36, 86.14s/it] 67%|██████▋ | 703/1050 [8:27:47<7:00:08, 72.65s/it] 67%|██████▋ | 704/1050 [8:28:26<5:59:57, 62.42s/it] 67%|██████▋ | 705/1050 [8:29:08<5:23:12, 56.21s/it] 67%|██████▋ | 706/1050 [8:29:50<4:59:08, 52.18s/it] 67%|██████▋ | 707/1050 [8:30:33<4:41:34, 49.26s/it] 67%|██████▋ | 708/1050 [8:31:16<4:29:28, 47.28s/it] 68%|██████▊ | 709/1050 [8:32:00<4:23:50, 46.42s/it] 68%|██████▊ | 710/1050 [8:32:42<4:15:58, 45.17s/it] 68%|██████▊ | 710/1050 [8:32:42<4:15:58, 45.17s/it] 68%|██████▊ | 711/1050 [8:33:24<4:09:38, 44.18s/it] 68%|██████▊ | 712/1050 [8:34:06<4:04:44, 43.45s/it] 68%|██████▊ | 713/1050 [8:34:48<4:02:33, 43.18s/it] 68%|██████▊ | 714/1050 [8:35:30<3:58:47, 42.64s/it] 68%|██████▊ | 715/1050 [8:36:12<3:56:43, 42.40s/it] 68%|██████▊ | 716/1050 [8:36:54<3:56:44, 42.53s/it] 68%|██████▊ | 717/1050 [8:37:37<3:55:24, 42.41s/it] 68%|██████▊ | 718/1050 [8:38:19<3:54:06, 42.31s/it] 68%|██████▊ | 719/1050 [8:39:01<3:54:10, 42.45s/it] 69%|██████▊ | 720/1050 [8:39:42<3:50:56, 41.99s/it] 69%|██████▊ | 720/1050 [8:39:42<3:50:56, 41.99s/it] 69%|██████▊ | 721/1050 [8:40:24<3:49:29, 41.85s/it] 69%|██████▉ | 722/1050 [8:41:07<3:50:45, 42.21s/it] 69%|██████▉ | 723/1050 [8:41:49<3:49:47, 42.16s/it] 69%|██████▉ | 724/1050 [8:42:32<3:49:50, 42.30s/it] 69%|██████▉ | 725/1050 [8:43:13<3:47:49, 42.06s/it] 69%|██████▉ | 726/1050 [8:43:55<3:47:04, 42.05s/it] 69%|██████▉ | 727/1050 [8:44:37<3:46:08, 42.01s/it] 69%|██████▉ | 728/1050 [8:45:19<3:45:09, 41.96s/it] 69%|██████▉ | 729/1050 [8:46:01<3:44:53, 42.04s/it] 70%|██████▉ | 730/1050 [8:46:43<3:44:00, 42.00s/it] 70%|██████▉ | 730/1050 [8:46:43<3:44:00, 42.00s/it] 70%|██████▉ | 731/1050 [8:47:26<3:45:22, 42.39s/it] 70%|██████▉ | 732/1050 [8:48:09<3:45:03, 42.46s/it] 70%|██████▉ | 733/1050 [8:48:51<3:44:04, 42.41s/it] 70%|██████▉ | 734/1050 [8:49:34<3:43:45, 42.49s/it] 70%|███████ | 735/1050 [8:50:17<3:43:17, 42.53s/it] 70%|███████ | 736/1050 [8:50:59<3:42:00, 42.42s/it] 70%|███████ | 737/1050 [8:51:40<3:39:40, 42.11s/it] 70%|███████ | 738/1050 [8:52:22<3:38:13, 41.97s/it] 70%|███████ | 739/1050 [8:53:04<3:38:26, 42.14s/it] 70%|███████ | 740/1050 [8:53:46<3:37:25, 42.08s/it] 70%|███████ | 740/1050 [8:53:46<3:37:25, 42.08s/it] 71%|███████ | 741/1050 [8:54:29<3:37:11, 42.17s/it] 71%|███████ | 742/1050 [8:55:11<3:36:12, 42.12s/it] 71%|███████ | 743/1050 [8:55:53<3:35:23, 42.10s/it] 71%|███████ | 744/1050 [8:56:35<3:34:23, 42.04s/it] 71%|███████ | 745/1050 [8:57:18<3:35:12, 42.34s/it] 71%|███████ | 746/1050 [8:58:00<3:35:24, 42.52s/it] 71%|███████ | 747/1050 [8:58:42<3:33:49, 42.34s/it] 71%|███████ | 748/1050 [8:59:25<3:33:10, 42.35s/it] 71%|███████▏ | 749/1050 [9:00:07<3:31:36, 42.18s/it] 71%|███████▏ | 750/1050 [9:00:50<3:32:41, 42.54s/it] 71%|███████▏ | 750/1050 [9:00:50<3:32:41, 42.54s/it] 72%|███████▏ | 751/1050 [9:01:32<3:31:16, 42.40s/it] 72%|███████▏ | 752/1050 [9:02:15<3:31:48, 42.65s/it] 72%|███████▏ | 753/1050 [9:02:58<3:31:20, 42.69s/it] 72%|███████▏ | 754/1050 [9:03:40<3:28:52, 42.34s/it] 72%|███████▏ | 755/1050 [9:04:22<3:28:07, 42.33s/it] 72%|███████▏ | 756/1050 [9:05:04<3:27:02, 42.25s/it] 72%|███████▏ | 757/1050 [9:05:46<3:25:45, 42.13s/it] 72%|███████▏ | 758/1050 [9:06:29<3:26:00, 42.33s/it] 72%|███████▏ | 759/1050 [9:07:11<3:25:12, 42.31s/it] 72%|███████▏ | 760/1050 [9:07:53<3:24:34, 42.33s/it] 72%|███████▏ | 760/1050 [9:07:53<3:24:34, 42.33s/it] 72%|███████▏ | 761/1050 [9:08:34<3:22:04, 41.95s/it] 73%|███████▎ | 762/1050 [9:09:17<3:22:07, 42.11s/it] 73%|███████▎ | 763/1050 [9:09:59<3:22:14, 42.28s/it] 73%|███████▎ | 764/1050 [9:10:41<3:21:10, 42.21s/it] 73%|███████▎ | 765/1050 [9:11:23<3:19:34, 42.02s/it] 73%|███████▎ | 766/1050 [9:12:05<3:19:10, 42.08s/it] 73%|███████▎ | 767/1050 [9:12:49<3:20:28, 42.51s/it] 73%|███████▎ | 768/1050 [9:13:30<3:18:08, 42.16s/it] 73%|███████▎ | 769/1050 [9:14:12<3:16:35, 41.98s/it] 73%|███████▎ | 770/1050 [9:14:54<3:15:55, 41.99s/it] 73%|███████▎ | 770/1050 [9:14:54<3:15:55, 41.99s/it] 73%|███████▎ | 771/1050 [9:15:35<3:14:03, 41.73s/it] 74%|███████▎ | 772/1050 [9:16:17<3:13:37, 41.79s/it] 74%|███████▎ | 773/1050 [9:17:00<3:15:31, 42.35s/it] 74%|███████▎ | 774/1050 [9:17:43<3:15:11, 42.43s/it] 74%|███████▍ | 775/1050 [9:18:25<3:13:22, 42.19s/it] 74%|███████▍ | 776/1050 [9:19:07<3:12:54, 42.24s/it] 74%|███████▍ | 777/1050 [9:19:49<3:12:14, 42.25s/it] 74%|███████▍ | 778/1050 [9:20:30<3:10:00, 41.92s/it] 74%|███████▍ | 779/1050 [9:21:13<3:10:14, 42.12s/it] 74%|███████▍ | 780/1050 [9:21:55<3:08:40, 41.93s/it] 74%|███████▍ | 780/1050 [9:21:55<3:08:40, 41.93s/it] 74%|███████▍ | 781/1050 [9:22:37<3:08:35, 42.07s/it] 74%|███████▍ | 782/1050 [9:23:20<3:09:39, 42.46s/it] 75%|███████▍ | 783/1050 [9:24:02<3:08:13, 42.30s/it] 75%|███████▍ | 784/1050 [9:24:44<3:07:07, 42.21s/it] 75%|███████▍ | 785/1050 [9:25:26<3:05:47, 42.07s/it] 75%|███████▍ | 786/1050 [9:26:08<3:05:23, 42.13s/it] 75%|███████▍ | 787/1050 [9:26:49<3:03:33, 41.88s/it] 75%|███████▌ | 788/1050 [9:27:32<3:03:30, 42.02s/it] 75%|███████▌ | 789/1050 [9:28:13<3:02:16, 41.90s/it] 75%|███████▌ | 790/1050 [9:28:56<3:02:05, 42.02s/it] 75%|███████▌ | 790/1050 [9:28:56<3:02:05, 42.02s/it] 75%|███████▌ | 791/1050 [9:29:37<3:00:42, 41.86s/it] 75%|███████▌ | 792/1050 [9:30:20<3:01:28, 42.21s/it] 76%|███████▌ | 793/1050 [9:31:02<3:00:14, 42.08s/it] 76%|███████▌ | 794/1050 [9:31:44<2:58:46, 41.90s/it] 76%|███████▌ | 795/1050 [9:32:26<2:58:50, 42.08s/it] 76%|███████▌ | 796/1050 [9:33:08<2:58:04, 42.06s/it] 76%|███████▌ | 797/1050 [9:33:50<2:56:34, 41.88s/it] 76%|███████▌ | 798/1050 [9:34:32<2:56:38, 42.06s/it] 76%|███████▌ | 799/1050 [9:35:14<2:56:20, 42.15s/it] 76%|███████▌ | 800/1050 [9:35:57<2:56:50, 42.44s/it] 76%|███████▌ | 800/1050 [9:35:57<2:56:50, 42.44s/it] 76%|███████▋ | 801/1050 [9:36:40<2:56:28, 42.53s/it] 76%|███████▋ | 802/1050 [9:37:22<2:55:02, 42.35s/it] 76%|███████▋ | 803/1050 [9:38:04<2:53:41, 42.19s/it] 77%|███████▋ | 804/1050 [9:38:47<2:53:50, 42.40s/it] 77%|███████▋ | 805/1050 [9:39:28<2:52:10, 42.17s/it] 77%|███████▋ | 806/1050 [9:40:10<2:51:03, 42.07s/it] 77%|███████▋ | 807/1050 [9:40:53<2:51:04, 42.24s/it] 77%|███████▋ | 808/1050 [9:41:34<2:49:20, 41.99s/it] 77%|███████▋ | 809/1050 [9:42:16<2:48:28, 41.94s/it] 77%|███████▋ | 810/1050 [9:42:58<2:47:45, 41.94s/it] 77%|███████▋ | 810/1050 [9:42:58<2:47:45, 41.94s/it] 77%|███████▋ | 811/1050 [9:43:41<2:48:19, 42.26s/it] 77%|███████▋ | 812/1050 [9:44:23<2:46:52, 42.07s/it] 77%|███████▋ | 813/1050 [9:45:06<2:47:29, 42.40s/it] 78%|███████▊ | 814/1050 [9:45:49<2:47:22, 42.55s/it] 78%|███████▊ | 815/1050 [9:46:31<2:46:22, 42.48s/it] 78%|███████▊ | 816/1050 [9:47:13<2:44:39, 42.22s/it] 78%|███████▊ | 817/1050 [9:47:54<2:42:58, 41.97s/it] 78%|███████▊ | 818/1050 [9:48:38<2:44:26, 42.53s/it] 78%|███████▊ | 819/1050 [9:49:21<2:44:16, 42.67s/it] 78%|███████▊ | 820/1050 [9:50:03<2:42:55, 42.50s/it] 78%|███████▊ | 820/1050 [9:50:03<2:42:55, 42.50s/it] 78%|███████▊ | 821/1050 [9:50:45<2:41:53, 42.42s/it] 78%|███████▊ | 822/1050 [9:51:28<2:41:35, 42.52s/it] 78%|███████▊ | 823/1050 [9:52:11<2:40:55, 42.54s/it] 78%|███████▊ | 824/1050 [9:52:53<2:39:33, 42.36s/it] 79%|███████▊ | 825/1050 [9:53:35<2:38:25, 42.25s/it] 79%|███████▊ | 826/1050 [9:54:16<2:37:02, 42.06s/it] 79%|███████▉ | 827/1050 [9:54:59<2:37:11, 42.30s/it] 79%|███████▉ | 828/1050 [9:55:41<2:36:17, 42.24s/it] 79%|███████▉ | 829/1050 [9:56:23<2:34:52, 42.05s/it] 79%|███████▉ | 830/1050 [9:57:05<2:34:06, 42.03s/it] 79%|███████▉ | 830/1050 [9:57:05<2:34:06, 42.03s/it] 79%|███████▉ | 831/1050 [9:57:47<2:33:17, 42.00s/it] 79%|███████▉ | 832/1050 [9:58:29<2:32:46, 42.05s/it] 79%|███████▉ | 833/1050 [9:59:11<2:31:50, 41.98s/it] 79%|███████▉ | 834/1050 [9:59:53<2:31:16, 42.02s/it] 80%|███████▉ | 835/1050 [10:00:35<2:30:50, 42.10s/it] 80%|███████▉ | 836/1050 [10:01:17<2:30:29, 42.19s/it] 80%|███████▉ | 837/1050 [10:01:59<2:29:26, 42.10s/it] 80%|███████▉ | 838/1050 [10:02:42<2:29:14, 42.24s/it] 80%|███████▉ | 839/1050 [10:03:25<2:29:01, 42.38s/it] 80%|████████ | 840/1050 [10:04:06<2:27:29, 42.14s/it] 80%|████████ | 840/1050 [10:04:06<2:27:29, 42.14s/it] 80%|████████ | 841/1050 [10:04:48<2:26:21, 42.02s/it] 80%|████████ | 842/1050 [10:05:30<2:25:22, 41.94s/it] 80%|████████ | 843/1050 [10:06:12<2:25:22, 42.14s/it] 80%|████████ | 844/1050 [10:06:54<2:24:38, 42.13s/it] 80%|████████ | 845/1050 [10:07:36<2:23:47, 42.09s/it] 81%|████████ | 846/1050 [10:08:19<2:23:26, 42.19s/it] 81%|████████ | 847/1050 [10:09:03<2:24:24, 42.68s/it] 81%|████████ | 848/1050 [10:09:44<2:22:30, 42.33s/it] 81%|████████ | 849/1050 [10:10:26<2:21:08, 42.13s/it] 81%|████████ | 850/1050 [10:11:08<2:20:42, 42.21s/it] 81%|████████ | 850/1050 [10:11:08<2:20:42, 42.21s/it] 81%|████████ | 851/1050 [10:11:51<2:20:38, 42.40s/it] 81%|████████ | 852/1050 [10:12:34<2:20:08, 42.47s/it] 81%|████████ | 853/1050 [10:13:15<2:18:36, 42.22s/it] 81%|████████▏ | 854/1050 [10:13:58<2:18:20, 42.35s/it] 81%|████████▏ | 855/1050 [10:14:39<2:16:49, 42.10s/it] 82%|████████▏ | 856/1050 [10:15:21<2:15:44, 41.98s/it] 82%|████████▏ | 857/1050 [10:16:03<2:15:03, 41.99s/it] 82%|████████▏ | 858/1050 [10:16:45<2:14:14, 41.95s/it] 82%|████████▏ | 859/1050 [10:17:28<2:14:11, 42.15s/it] 82%|████████▏ | 860/1050 [10:18:12<2:15:03, 42.65s/it] 82%|████████▏ | 860/1050 [10:18:12<2:15:03, 42.65s/it] 82%|████████▏ | 861/1050 [10:18:54<2:14:25, 42.68s/it] 82%|████████▏ | 862/1050 [10:19:36<2:13:17, 42.54s/it] 82%|████████▏ | 863/1050 [10:20:17<2:11:09, 42.08s/it] 82%|████████▏ | 864/1050 [10:21:01<2:11:42, 42.49s/it] 82%|████████▏ | 865/1050 [10:21:43<2:11:04, 42.51s/it] 82%|████████▏ | 866/1050 [10:22:25<2:09:01, 42.07s/it] 83%|████████▎ | 867/1050 [10:23:08<2:09:22, 42.42s/it] 83%|████████▎ | 868/1050 [10:23:50<2:08:12, 42.27s/it] 83%|████████▎ | 869/1050 [10:24:31<2:06:25, 41.91s/it] 83%|████████▎ | 870/1050 [10:25:13<2:06:02, 42.01s/it] 83%|████████▎ | 870/1050 [10:25:13<2:06:02, 42.01s/it] 83%|████████▎ | 871/1050 [10:25:55<2:05:35, 42.10s/it] 83%|████████▎ | 872/1050 [10:26:38<2:05:16, 42.23s/it] 83%|████████▎ | 873/1050 [10:27:19<2:03:45, 41.95s/it] 83%|████████▎ | 874/1050 [10:28:01<2:02:59, 41.93s/it] 83%|████████▎ | 875/1050 [10:28:43<2:02:07, 41.87s/it][INFO|trainer.py:3719] 2024-07-16 19:44:45,913 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 19:44:45,913 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 19:44:45,913 >> Batch size = 1 {'eval_loss': 0.27484622597694397, 'eval_accuracy': 0.9049666666666666, 'eval_runtime': 207.5459, 'eval_samples_per_second': 12.046, 'eval_steps_per_second': 12.046, 'epoch': 3.98} {'loss': 0.1074, 'grad_norm': 0.7140550017356873, 'learning_rate': 2.8680959727287317e-05, 'epoch': 4.04} {'loss': 0.0856, 'grad_norm': 1.042784333229065, 'learning_rate': 2.718946713234185e-05, 'epoch': 4.09} {'loss': 0.0837, 'grad_norm': 0.7461434602737427, 'learning_rate': 2.5723182140992387e-05, 'epoch': 4.15} {'loss': 0.0749, 'grad_norm': 1.004974603652954, 'learning_rate': 2.428372512445233e-05, 'epoch': 4.21} {'loss': 0.0835, 'grad_norm': 0.6601787805557251, 'learning_rate': 2.2872686806712035e-05, 'epoch': 4.26} {'loss': 0.0685, 'grad_norm': 0.6331136226654053, 'learning_rate': 2.1491626506651914e-05, 'epoch': 4.32} {'loss': 0.0787, 'grad_norm': 0.8726811408996582, 'learning_rate': 2.0142070414860704e-05, 'epoch': 4.38} {'loss': 0.0783, 'grad_norm': 0.8191830515861511, 'learning_rate': 1.8825509907063327e-05, 'epoch': 4.43} {'loss': 0.0759, 'grad_norm': 0.6537315249443054, 'learning_rate': 1.7543399896022405e-05, 'epoch': 4.49} {'loss': 0.0952, 'grad_norm': 1.3562214374542236, 'learning_rate': 1.629715722373423e-05, 'epoch': 4.55} {'loss': 0.0916, 'grad_norm': 1.0490374565124512, 'learning_rate': 1.5088159095696363e-05, 'epoch': 4.61} {'loss': 0.0874, 'grad_norm': 0.8466753363609314, 'learning_rate': 1.3917741558976894e-05, 'epoch': 4.66} {'loss': 0.0879, 'grad_norm': 0.9645017385482788, 'learning_rate': 1.2787198025767416e-05, 'epoch': 4.72} {'loss': 0.0697, 'grad_norm': 3.8121776580810547, 'learning_rate': 1.1697777844051105e-05, 'epoch': 4.78} {'loss': 0.0873, 'grad_norm': 1.1398985385894775, 'learning_rate': 1.0650684916965559e-05, 'epoch': 4.83} {'loss': 0.0826, 'grad_norm': 1.19906485080719, 'learning_rate': 9.647076372386194e-06, 'epoch': 4.89} {'loss': 0.073, 'grad_norm': 0.7644526958465576, 'learning_rate': 8.688061284200266e-06, 'epoch': 4.95} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-875 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 19:48:13,061 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 19:48:13,062 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 19:48:13,241 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-875/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 19:48:13,242 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-875/special_tokens_map.json 83%|████████▎ | 876/1050 [10:32:52<5:02:10, 104.20s/it] 84%|████████▎ | 877/1050 [10:33:34<4:06:10, 85.38s/it] 84%|████████▎ | 878/1050 [10:34:15<3:27:00, 72.21s/it] 84%|████████▎ | 879/1050 [10:34:58<3:00:08, 63.21s/it] 84%|████████▍ | 880/1050 [10:35:35<2:37:35, 55.62s/it] 84%|████████▍ | 880/1050 [10:35:35<2:37:35, 55.62s/it] 84%|████████▍ | 881/1050 [10:36:17<2:24:50, 51.42s/it] 84%|████████▍ | 882/1050 [10:36:59<2:16:21, 48.70s/it] 84%|████████▍ | 883/1050 [10:37:42<2:10:49, 47.00s/it] 84%|████████▍ | 884/1050 [10:38:25<2:06:03, 45.57s/it] 84%|████████▍ | 885/1050 [10:39:07<2:02:29, 44.55s/it] 84%|████████▍ | 886/1050 [10:39:50<2:00:59, 44.27s/it] 84%|████████▍ | 887/1050 [10:40:33<1:58:52, 43.76s/it] 85%|████████▍ | 888/1050 [10:41:15<1:56:50, 43.28s/it] 85%|████████▍ | 889/1050 [10:41:58<1:55:23, 43.00s/it] 85%|████████▍ | 890/1050 [10:42:40<1:54:02, 42.76s/it] 85%|████████▍ | 890/1050 [10:42:40<1:54:02, 42.76s/it] 85%|████████▍ | 891/1050 [10:43:21<1:52:23, 42.41s/it] 85%|████████▍ | 892/1050 [10:44:03<1:51:07, 42.20s/it] 85%|████████▌ | 893/1050 [10:44:46<1:51:18, 42.54s/it] 85%|████████▌ | 894/1050 [10:45:28<1:50:14, 42.40s/it] 85%|████████▌ | 895/1050 [10:46:11<1:49:29, 42.38s/it] 85%|████████▌ | 896/1050 [10:46:54<1:49:03, 42.49s/it] 85%|████████▌ | 897/1050 [10:47:35<1:47:32, 42.17s/it] 86%|████████▌ | 898/1050 [10:48:18<1:47:17, 42.35s/it] 86%|████████▌ | 899/1050 [10:49:00<1:46:49, 42.45s/it] 86%|████████▌ | 900/1050 [10:49:43<1:46:10, 42.47s/it] 86%|████████▌ | 900/1050 [10:49:43<1:46:10, 42.47s/it] 86%|████████▌ | 901/1050 [10:50:25<1:45:01, 42.29s/it] 86%|████████▌ | 902/1050 [10:51:08<1:44:45, 42.47s/it] 86%|████████▌ | 903/1050 [10:51:50<1:43:50, 42.38s/it] 86%|████████▌ | 904/1050 [10:52:32<1:42:37, 42.18s/it] 86%|████████▌ | 905/1050 [10:53:14<1:42:27, 42.40s/it] 86%|████████▋ | 906/1050 [10:53:57<1:41:53, 42.45s/it] 86%|████████▋ | 907/1050 [10:54:40<1:41:18, 42.50s/it] 86%|████████▋ | 908/1050 [10:55:22<1:40:10, 42.33s/it] 87%|████████▋ | 909/1050 [10:56:03<1:38:53, 42.08s/it] 87%|████████▋ | 910/1050 [10:56:46<1:38:31, 42.22s/it] 87%|████████▋ | 910/1050 [10:56:46<1:38:31, 42.22s/it] 87%|████████▋ | 911/1050 [10:57:29<1:38:46, 42.63s/it] 87%|████████▋ | 912/1050 [10:58:11<1:37:26, 42.37s/it] 87%|████████▋ | 913/1050 [10:58:54<1:36:51, 42.42s/it] 87%|████████▋ | 914/1050 [10:59:36<1:36:02, 42.37s/it] 87%|████████▋ | 915/1050 [11:00:18<1:35:23, 42.39s/it] 87%|████████▋ | 916/1050 [11:01:01<1:35:13, 42.63s/it] 87%|████████▋ | 917/1050 [11:01:43<1:33:47, 42.31s/it] 87%|████████▋ | 918/1050 [11:02:26<1:33:21, 42.43s/it] 88%|████████▊ | 919/1050 [11:03:08<1:32:28, 42.35s/it] 88%|████████▊ | 920/1050 [11:03:51<1:31:56, 42.44s/it] 88%|████████▊ | 920/1050 [11:03:51<1:31:56, 42.44s/it] 88%|████████▊ | 921/1050 [11:04:33<1:31:30, 42.56s/it] 88%|████████▊ | 922/1050 [11:05:15<1:29:56, 42.16s/it] 88%|████████▊ | 923/1050 [11:05:56<1:28:51, 41.98s/it] 88%|████████▊ | 924/1050 [11:06:40<1:29:07, 42.44s/it] 88%|████████▊ | 925/1050 [11:07:23<1:28:49, 42.64s/it] 88%|████████▊ | 926/1050 [11:08:05<1:27:44, 42.45s/it] 88%|████████▊ | 927/1050 [11:08:48<1:27:15, 42.56s/it] 88%|████████▊ | 928/1050 [11:09:29<1:25:56, 42.27s/it] 88%|████████▊ | 929/1050 [11:10:12<1:25:23, 42.35s/it] 89%|████████▊ | 930/1050 [11:10:53<1:23:56, 41.97s/it] 89%|████████▊ | 930/1050 [11:10:53<1:23:56, 41.97s/it] 89%|████████▊ | 931/1050 [11:11:36<1:23:48, 42.25s/it] 89%|████████▉ | 932/1050 [11:12:18<1:23:14, 42.33s/it] 89%|████████▉ | 933/1050 [11:13:00<1:22:24, 42.26s/it] 89%|████████▉ | 934/1050 [11:13:42<1:21:36, 42.21s/it] 89%|████████▉ | 935/1050 [11:14:25<1:21:01, 42.27s/it] 89%|████████▉ | 936/1050 [11:15:07<1:20:02, 42.13s/it] 89%|████████▉ | 937/1050 [11:15:48<1:19:00, 41.95s/it] 89%|████████▉ | 938/1050 [11:16:31<1:18:37, 42.12s/it] 89%|████████▉ | 939/1050 [11:17:14<1:18:27, 42.41s/it] 90%|████████▉ | 940/1050 [11:17:56<1:17:43, 42.39s/it] 90%|████████▉ | 940/1050 [11:17:56<1:17:43, 42.39s/it] 90%|████████▉ | 941/1050 [11:18:38<1:16:41, 42.21s/it] 90%|████████▉ | 942/1050 [11:19:20<1:15:58, 42.21s/it] 90%|████████▉ | 943/1050 [11:20:03<1:15:39, 42.42s/it] 90%|████████▉ | 944/1050 [11:20:46<1:14:59, 42.45s/it] 90%|█████████ | 945/1050 [11:21:29<1:14:42, 42.69s/it] 90%|█████████ | 946/1050 [11:22:11<1:13:55, 42.65s/it] 90%|█████████ | 947/1050 [11:22:54<1:13:18, 42.70s/it] 90%|█████████ | 948/1050 [11:23:36<1:12:02, 42.38s/it] 90%|█████████ | 949/1050 [11:24:18<1:11:27, 42.45s/it] 90%|█████████ | 950/1050 [11:25:00<1:10:04, 42.04s/it] 90%|█████████ | 950/1050 [11:25:00<1:10:04, 42.04s/it] 91%|█████████ | 951/1050 [11:25:41<1:09:07, 41.89s/it] 91%|█████████ | 952/1050 [11:26:23<1:08:27, 41.92s/it] 91%|█████████ | 953/1050 [11:27:05<1:07:55, 42.02s/it] 91%|█████████ | 954/1050 [11:27:48<1:07:35, 42.25s/it] 91%|█████████ | 955/1050 [11:28:31<1:07:00, 42.32s/it] 91%|█████████ | 956/1050 [11:29:13<1:06:19, 42.34s/it] 91%|█████████ | 957/1050 [11:29:55<1:05:43, 42.40s/it] 91%|█████████ | 958/1050 [11:30:38<1:04:55, 42.35s/it] 91%|█████████▏| 959/1050 [11:31:19<1:03:49, 42.09s/it] 91%|█████████▏| 960/1050 [11:32:01<1:03:09, 42.11s/it] 91%|█████████▏| 960/1050 [11:32:01<1:03:09, 42.11s/it] 92%|█████████▏| 961/1050 [11:32:43<1:02:19, 42.01s/it] 92%|█████████▏| 962/1050 [11:33:25<1:01:29, 41.93s/it] 92%|█████████▏| 963/1050 [11:34:08<1:01:15, 42.25s/it] 92%|█████████▏| 964/1050 [11:34:51<1:00:59, 42.56s/it] 92%|█████████▏| 965/1050 [11:35:34<1:00:22, 42.62s/it] 92%|█████████▏| 966/1050 [11:36:16<59:29, 42.49s/it] 92%|█████████▏| 967/1050 [11:36:57<58:12, 42.08s/it] 92%|█████████▏| 968/1050 [11:37:39<57:33, 42.11s/it] 92%|█████████▏| 969/1050 [11:38:21<56:46, 42.06s/it] 92%|█████████▏| 970/1050 [11:39:04<56:13, 42.17s/it] 92%|█████████▏| 970/1050 [11:39:04<56:13, 42.17s/it] 92%|█████████▏| 971/1050 [11:39:46<55:40, 42.29s/it] 93%|█████████▎| 972/1050 [11:40:29<55:05, 42.37s/it] 93%|█████████▎| 973/1050 [11:41:11<54:06, 42.17s/it] 93%|█████████▎| 974/1050 [11:41:52<53:14, 42.03s/it] 93%|█████████▎| 975/1050 [11:42:35<52:56, 42.35s/it] 93%|█████████▎| 976/1050 [11:43:18<52:25, 42.50s/it] 93%|█████████▎| 977/1050 [11:44:00<51:27, 42.29s/it] 93%|█████████▎| 978/1050 [11:44:42<50:40, 42.23s/it] 93%|█████████▎| 979/1050 [11:45:24<49:44, 42.03s/it] 93%|█████████▎| 980/1050 [11:46:06<48:59, 41.99s/it] 93%|█████████▎| 980/1050 [11:46:06<48:59, 41.99s/it] 93%|█████████▎| 981/1050 [11:46:48<48:32, 42.21s/it] 94%|█████████▎| 982/1050 [11:47:30<47:35, 41.99s/it] 94%|█████████▎| 983/1050 [11:48:12<47:07, 42.21s/it] 94%|█████████▎| 984/1050 [11:48:54<46:11, 41.99s/it] 94%|█████████▍| 985/1050 [11:49:35<45:16, 41.79s/it] 94%|█████████▍| 986/1050 [11:50:18<44:57, 42.15s/it] 94%|█████████▍| 987/1050 [11:51:02<44:36, 42.48s/it] 94%|█████████▍| 988/1050 [11:51:44<43:59, 42.58s/it] 94%|█████████▍| 989/1050 [11:52:27<43:25, 42.72s/it] 94%|█████████▍| 990/1050 [11:53:12<43:13, 43.22s/it] 94%|█████████▍| 990/1050 [11:53:12<43:13, 43.22s/it] 94%|█████████▍| 991/1050 [11:53:53<42:03, 42.77s/it] 94%|█████████▍| 992/1050 [11:54:36<41:09, 42.57s/it] 95%|█████████▍| 993/1050 [11:55:19<40:32, 42.68s/it] 95%|█████████▍| 994/1050 [11:56:00<39:34, 42.40s/it] 95%|█████████▍| 995/1050 [11:56:43<38:58, 42.51s/it] 95%|█████████▍| 996/1050 [11:57:25<38:01, 42.24s/it] 95%|█████████▍| 997/1050 [11:58:07<37:18, 42.24s/it] 95%|█████████▌| 998/1050 [11:58:50<36:46, 42.43s/it] 95%|█████████▌| 999/1050 [11:59:31<35:52, 42.21s/it] 95%|█████████▌| 1000/1050 [12:00:14<35:10, 42.21s/it] 95%|█████████▌| 1000/1050 [12:00:14<35:10, 42.21s/it] 95%|█████████▌| 1001/1050 [12:00:57<34:37, 42.40s/it] 95%|█████████▌| 1002/1050 [12:01:38<33:35, 41.99s/it] 96%|█████████▌| 1003/1050 [12:02:21<33:08, 42.30s/it] 96%|█████████▌| 1004/1050 [12:03:02<32:18, 42.15s/it] 96%|█████████▌| 1005/1050 [12:03:45<31:39, 42.21s/it] 96%|█████████▌| 1006/1050 [12:04:27<30:56, 42.20s/it] 96%|█████████▌| 1007/1050 [12:05:10<30:29, 42.54s/it] 96%|█████████▌| 1008/1050 [12:05:52<29:39, 42.37s/it] 96%|█████████▌| 1009/1050 [12:06:34<28:55, 42.34s/it] 96%|█████████▌| 1010/1050 [12:07:16<28:03, 42.09s/it] 96%|█████████▌| 1010/1050 [12:07:16<28:03, 42.09s/it] 96%|█████████▋| 1011/1050 [12:07:59<27:29, 42.30s/it] 96%|█████████▋| 1012/1050 [12:08:41<26:49, 42.35s/it] 96%|█████████▋| 1013/1050 [12:09:23<26:01, 42.20s/it] 97%|█████████▋| 1014/1050 [12:10:05<25:12, 42.00s/it] 97%|█████████▋| 1015/1050 [12:10:47<24:37, 42.21s/it] 97%|█████████▋| 1016/1050 [12:11:29<23:54, 42.19s/it] 97%|█████████▋| 1017/1050 [12:12:12<23:11, 42.16s/it] 97%|█████████▋| 1018/1050 [12:12:55<22:39, 42.48s/it] 97%|█████████▋| 1019/1050 [12:13:36<21:49, 42.23s/it] 97%|█████████▋| 1020/1050 [12:14:19<21:09, 42.32s/it] 97%|█████████▋| 1020/1050 [12:14:19<21:09, 42.32s/it] 97%|█████████▋| 1021/1050 [12:15:01<20:25, 42.24s/it] 97%|█████████▋| 1022/1050 [12:15:43<19:40, 42.16s/it] 97%|█████████▋| 1023/1050 [12:16:25<18:58, 42.15s/it] 98%|█████████▊| 1024/1050 [12:17:08<18:22, 42.42s/it] 98%|█████████▊| 1025/1050 [12:17:50<17:38, 42.34s/it] 98%|█████████▊| 1026/1050 [12:18:33<17:02, 42.59s/it] 98%|█████████▊| 1027/1050 [12:19:15<16:14, 42.35s/it] 98%|█████████▊| 1028/1050 [12:19:57<15:30, 42.31s/it] 98%|█████████▊| 1029/1050 [12:20:40<14:51, 42.45s/it] 98%|█████████▊| 1030/1050 [12:21:23<14:08, 42.44s/it] 98%|█████████▊| 1030/1050 [12:21:23<14:08, 42.44s/it] 98%|█████████▊| 1031/1050 [12:22:05<13:24, 42.36s/it] 98%|█████████▊| 1032/1050 [12:22:47<12:43, 42.41s/it] 98%|█████████▊| 1033/1050 [12:23:29<11:57, 42.20s/it] 98%|█████████▊| 1034/1050 [12:24:12<11:17, 42.34s/it] 99%|█████████▊| 1035/1050 [12:24:54<10:35, 42.36s/it] 99%|█████████▊| 1036/1050 [12:25:36<09:49, 42.10s/it] 99%|█████████▉| 1037/1050 [12:26:18<09:09, 42.24s/it] 99%|█████████▉| 1038/1050 [12:27:00<08:25, 42.13s/it] 99%|█████████▉| 1039/1050 [12:27:42<07:41, 41.97s/it] 99%|█████████▉| 1040/1050 [12:28:23<06:58, 41.88s/it] 99%|█████████▉| 1040/1050 [12:28:23<06:58, 41.88s/it] 99%|█████████▉| 1041/1050 [12:29:06<06:19, 42.22s/it] 99%|█████████▉| 1042/1050 [12:29:50<05:40, 42.56s/it] 99%|█████████▉| 1043/1050 [12:30:32<04:58, 42.59s/it] 99%|█████████▉| 1044/1050 [12:31:15<04:14, 42.50s/it] 100%|█████████▉| 1045/1050 [12:31:58<03:33, 42.71s/it] 100%|█████████▉| 1046/1050 [12:32:41<02:51, 42.91s/it] 100%|█████████▉| 1047/1050 [12:33:23<02:07, 42.67s/it] 100%|█████████▉| 1048/1050 [12:34:05<01:24, 42.49s/it] 100%|█████████▉| 1049/1050 [12:34:48<00:42, 42.53s/it] 100%|██████████| 1050/1050 [12:35:30<00:00, 42.34s/it] 100%|██████████| 1050/1050 [12:35:30<00:00, 42.34s/it][INFO|trainer.py:3719] 2024-07-16 21:51:33,138 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 21:51:33,138 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 21:51:33,138 >> Batch size = 1 {'eval_loss': 0.33086976408958435, 'eval_accuracy': 0.9083, 'eval_runtime': 206.4018, 'eval_samples_per_second': 12.112, 'eval_steps_per_second': 12.112, 'epoch': 4.98} {'loss': 0.082, 'grad_norm': 0.686549723148346, 'learning_rate': 7.774699446684608e-06, 'epoch': 5.0} {'loss': 0.0516, 'grad_norm': 0.5072228312492371, 'learning_rate': 6.908000203341802e-06, 'epoch': 5.06} {'loss': 0.0607, 'grad_norm': 0.5079275965690613, 'learning_rate': 6.088921331488568e-06, 'epoch': 5.12} {'loss': 0.0451, 'grad_norm': 0.8684201240539551, 'learning_rate': 5.318367983829392e-06, 'epoch': 5.17} {'loss': 0.0516, 'grad_norm': 0.822227954864502, 'learning_rate': 4.597191688184754e-06, 'epoch': 5.23} {'loss': 0.0479, 'grad_norm': 0.5135931968688965, 'learning_rate': 3.9261894064796135e-06, 'epoch': 5.29} {'loss': 0.0547, 'grad_norm': 0.5866013765335083, 'learning_rate': 3.306102654031823e-06, 'epoch': 5.34} {'loss': 0.0555, 'grad_norm': 1.5747359991073608, 'learning_rate': 2.737616680113758e-06, 'epoch': 5.4} {'loss': 0.0495, 'grad_norm': 2.3203916549682617, 'learning_rate': 2.221359710692961e-06, 'epoch': 5.46} {'loss': 0.0522, 'grad_norm': 0.6227638721466064, 'learning_rate': 1.757902254188254e-06, 'epoch': 5.52} {'loss': 0.0536, 'grad_norm': 1.4704910516738892, 'learning_rate': 1.3477564710088098e-06, 'epoch': 5.57} {'loss': 0.0518, 'grad_norm': 0.4436222314834595, 'learning_rate': 9.913756075728087e-07, 'epoch': 5.63} {'loss': 0.0491, 'grad_norm': 0.6575887799263, 'learning_rate': 6.891534954310885e-07, 'epoch': 5.69} {'loss': 0.0355, 'grad_norm': 0.5118672251701355, 'learning_rate': 4.4142411604936597e-07, 'epoch': 5.74} {'loss': 0.0425, 'grad_norm': 0.8050460815429688, 'learning_rate': 2.4846123172992954e-07, 'epoch': 5.8} {'loss': 0.0553, 'grad_norm': 0.6702028512954712, 'learning_rate': 1.1047808308075058e-07, 'epoch': 5.86} {'loss': 0.061, 'grad_norm': 0.4666765630245209, 'learning_rate': 2.7627153366222013e-08, 'epoch': 5.91} {'loss': 0.0484, 'grad_norm': 0.5390927195549011, 'learning_rate': 0.0, 'epoch': 5.97} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-1050 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 21:55:00,319 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 21:55:00,320 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 21:55:00,504 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-1050/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 21:55:00,506 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/checkpoint-1050/special_tokens_map.json [INFO|trainer.py:2329] 2024-07-16 21:55:00,991 >> Training completed. Do not forget to share your model on huggingface.co/models =) 100%|██████████| 1050/1050 [12:38:58<00:00, 42.34s/it] 100%|██████████| 1050/1050 [12:38:58<00:00, 43.37s/it] [INFO|trainer.py:3410] 2024-07-16 21:55:00,996 >> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p1_full /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 21:55:01,515 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 21:55:01,515 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-16 21:55:01,694 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p1_full/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-16 21:55:01,695 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p1_full/special_tokens_map.json [INFO|trainer.py:3719] 2024-07-16 21:55:02,127 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-16 21:55:02,127 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-16 21:55:02,128 >> Batch size = 1 {'eval_loss': 0.43139833211898804, 'eval_accuracy': 0.9057333333333334, 'eval_runtime': 206.2708, 'eval_samples_per_second': 12.12, 'eval_steps_per_second': 12.12, 'epoch': 5.97} {'train_runtime': 45549.7669, 'train_samples_per_second': 2.964, 'train_steps_per_second': 0.023, 'train_loss': 0.1699013292221796, 'epoch': 5.97} ***** train metrics ***** epoch = 5.9701 total_flos = 2841612535GF train_loss = 0.1699 train_runtime = 12:39:09.76 train_samples_per_second = 2.964 train_steps_per_second = 0.023 Figure saved at: saves/llama3-8b/lora/sft_bf16_p1_full/training_loss.png Figure saved at: saves/llama3-8b/lora/sft_bf16_p1_full/training_eval_loss.png Figure saved at: saves/llama3-8b/lora/sft_bf16_p1_full/training_eval_accuracy.png 0%| | 0/2500 [00:00> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.9057333333333334}]} ***** eval metrics ***** epoch = 5.9701 eval_accuracy = 0.9057 eval_loss = 0.4314 eval_runtime = 0:03:27.44 eval_samples_per_second = 12.051 eval_steps_per_second = 12.051 wandb: - 0.014 MB of 0.014 MB uploaded wandb: \ 0.023 MB of 0.063 MB uploaded wandb: | 0.023 MB of 0.063 MB uploaded wandb: / 0.063 MB of 0.063 MB uploaded wandb: wandb: Run history: wandb: eval/accuracy █▃▁▄▇▅▅ wandb: eval/loss ▁▁▁▂▄██ wandb: eval/runtime ▁▁▄█▂▁▇ wandb: eval/samples_per_second ██▅▁▇█▁ wandb: eval/steps_per_second ██▅▁▇█▁ wandb: train/epoch ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████ wandb: train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████ wandb: train/grad_norm ▄▁▅▄▃▃▄▃▆▅██▃▃▄▇▂▂▃▄▂▇▂▃▃▂▂▂▂▂▃▂▃▁▂▁▁▁▂▁ wandb: train/learning_rate ▂▃▅▇██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁ wandb: train/loss █▃▃▃▃▃▃▃▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: wandb: Run summary: wandb: eval/accuracy 0.90573 wandb: eval/loss 0.4314 wandb: eval/runtime 207.4441 wandb: eval/samples_per_second 12.051 wandb: eval/steps_per_second 12.051 wandb: total_flos 3.0511582264790876e+18 wandb: train/epoch 5.97015 wandb: train/global_step 1050 wandb: train/grad_norm 0.53909 wandb: train/learning_rate 0.0 wandb: train/loss 0.0484 wandb: train_loss 0.1699 wandb: train_runtime 45549.7669 wandb: train_samples_per_second 2.964 wandb: train_steps_per_second 0.023 wandb: wandb: 🚀 View run llama3_8b_p1_full at: https://wandb.ai/inflaton-ai/huggingface/runs/t1nqedbn wandb: ⭐️ View project at: https://wandb.ai/inflaton-ai/huggingface wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240716_091552-t1nqedbn/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. Tuning shenzhi-wang/Llama3-8B-Chinese-Chat with config/llama3-8b_lora_sft_bf16-p2.yaml Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning/llama-factory config/llama3-8b_lora_sft_bf16-p2.yaml: { "model_name_or_path": "shenzhi-wang/Llama3-8B-Chinese-Chat", "stage": "sft", "do_train": true, "finetuning_type": "lora", "lora_target": "all", "loraplus_lr_ratio": 16.0, "upcast_layernorm": true, "dataset": "alpaca_mgtv_p2", "template": "llama3", "cutoff_len": 4096, "max_samples": 25000, "overwrite_cache": true, "preprocessing_num_workers": 16, "output_dir": "saves/llama3-8b/lora/sft_bf16_p2_full", "logging_steps": 10, "save_steps": 175, "plot_loss": true, "per_device_train_batch_size": 16, "gradient_accumulation_steps": 8, "learning_rate": 0.0001, "num_train_epochs": 6.0, "lr_scheduler_type": "cosine", "warmup_ratio": 0.1, "bf16": true, "ddp_timeout": 180000000, "val_size": 0.1, "per_device_eval_batch_size": 1, "eval_strategy": "steps", "eval_steps": 175, "report_to": "wandb", "run_name": "llama3_8b_p2_full" } 07/16/2024 21:58:48 - INFO - llamafactory.hparams.parser - Resuming training from saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-525. 07/16/2024 21:58:48 - INFO - llamafactory.hparams.parser - Change `output_dir` or use `overwrite_output_dir` to avoid. 07/16/2024 21:58:48 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2108] 2024-07-16 21:59:02,101 >> loading file tokenizer.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer.json [INFO|tokenization_utils_base.py:2108] 2024-07-16 21:59:02,101 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2108] 2024-07-16 21:59:02,101 >> loading file special_tokens_map.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/special_tokens_map.json [INFO|tokenization_utils_base.py:2108] 2024-07-16 21:59:02,101 >> loading file tokenizer_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer_config.json [WARNING|logging.py:314] 2024-07-16 21:59:02,314 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/16/2024 21:59:02 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|> 07/16/2024 21:59:02 - INFO - llamafactory.data.loader - Loading dataset alpaca_mgtv_p2.json... Converting format of dataset (num_proc=16): 0%| | 0/25000 [00:00<|start_header_id|>user<|end_header_id|> 你是一个情景猜谜游戏的主持人。游戏规则如下: 1. 参与者会得到一个谜面,谜面会描述一个简单又难以理解的事件。 2. 主持人知道谜底,谜底是谜面的答案。 3. 参与者可以询问任何封闭式问题来找寻事件的真相。 4. 对于每个问题,主持人将根据实际情况回答以下五个选项之一:是、不是、不重要、回答正确、问法错误。各回答的判断标准如下: - 若谜面和谜底能找到问题的答案,回答:是或者不是 - 若谜面和谜底不能直接或者间接推断出问题的答案,回答:不重要 - 若参与者提问不是一个封闭式问题或者问题难以理解,回答:问法错误 - 若参与者提问基本还原了谜底真相,回答:回答正确 5. 回答中不能添加任何其它信息,也不能省略选项中的任何一个字。例如,不可以把“不是”省略成“不”。 请严格按照这些规则回答参与者提出的问题。 **谜面:** 在甄家村里,有一个古老的传说:每年南瓜丰收的季节,南瓜田里总有一个最大的南瓜会不翼而飞,村民们对此现象困惑不解。请找出南瓜失踪背后的原因。 **谜底:** 真相原来与一位年迈的农夫有关。这位农夫年轻时,曾与一位美丽的姑娘相恋。他们约定在南瓜丰收的季节结婚。然而,命运弄人,姑娘在婚礼前的一场意外中离世。悲伤的农夫为了纪念心爱的姑娘,每年都会将最大的南瓜偷走,放到姑娘的墓前,以此寄托自己的哀思。这一行为延续了多年,成为了乡村里一个神秘的传说。 **参与者提出的问题:** 偷的人信神吗 <|eot_id|><|start_header_id|>assistant<|end_header_id|> 不是<|eot_id|> /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-16 21:59:09,431 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-16 21:59:09,432 >> Model config LlamaConfig { "_name_or_path": "shenzhi-wang/Llama3-8B-Chinese-Chat", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|modeling_utils.py:3474] 2024-07-16 21:59:09,516 >> loading weights file model.safetensors from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-07-16 21:59:09,517 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:962] 2024-07-16 21:59:09,518 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009 } label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 103668, 128009] labels: 不是<|eot_id|> Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4288] 2024-07-16 21:59:18,456 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at shenzhi-wang/Llama3-8B-Chinese-Chat. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:917] 2024-07-16 21:59:18,692 >> loading configuration file generation_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/generation_config.json [INFO|configuration_utils.py:962] 2024-07-16 21:59:18,693 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009, "pad_token_id": 128009 } 07/16/2024 21:59:18 - INFO - llamafactory.model.model_utils.checkpointing - Upcasting layernorm weights in float32. 07/16/2024 21:59:18 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 07/16/2024 21:59:18 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 07/16/2024 21:59:18 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 07/16/2024 21:59:18 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 07/16/2024 21:59:18 - INFO - llamafactory.model.model_utils.misc - Found linear modules: up_proj,v_proj,o_proj,gate_proj,k_proj,down_proj,q_proj 07/16/2024 21:59:19 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605 Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:641] 2024-07-16 21:59:19,179 >> Using auto half precision backend [INFO|trainer.py:2461] 2024-07-16 21:59:19,181 >> Loading model from saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-525. 07/16/2024 21:59:19 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. [INFO|trainer.py:2078] 2024-07-16 21:59:20,044 >> ***** Running training ***** [INFO|trainer.py:2079] 2024-07-16 21:59:20,044 >> Num examples = 22,500 [INFO|trainer.py:2080] 2024-07-16 21:59:20,044 >> Num Epochs = 6 [INFO|trainer.py:2081] 2024-07-16 21:59:20,044 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2084] 2024-07-16 21:59:20,044 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2085] 2024-07-16 21:59:20,044 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2086] 2024-07-16 21:59:20,044 >> Total optimization steps = 1,050 [INFO|trainer.py:2087] 2024-07-16 21:59:20,048 >> Number of trainable parameters = 20,971,520 [INFO|trainer.py:2109] 2024-07-16 21:59:20,049 >> Continuing training from checkpoint, will skip to saved global_step [INFO|trainer.py:2110] 2024-07-16 21:59:20,049 >> Continuing training from epoch 3 [INFO|trainer.py:2111] 2024-07-16 21:59:20,049 >> Continuing training from global step 525 [INFO|trainer.py:2113] 2024-07-16 21:59:20,049 >> Will skip the first 3 epochs then the first 0 batches in the first epoch. [INFO|integration_utils.py:723] 2024-07-16 21:59:20,053 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: inflaton-sg (inflaton-ai). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.17.4 wandb: Run data is saved locally in /common2/dh.huang.2023/code/logical-reasoning/llama-factory/wandb/run-20240716_215921-b1vc6sbw wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama3_8b_p2_full wandb: ⭐️ View project at https://wandb.ai/inflaton-ai/huggingface wandb: 🚀 View run at https://wandb.ai/inflaton-ai/huggingface/runs/b1vc6sbw 0%| | 0/1050 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-17 00:36:20,827 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-17 00:36:20,827 >> Batch size = 1 {'loss': 0.0974, 'grad_norm': 0.8705387711524963, 'learning_rate': 5.786276438761927e-05, 'epoch': 3.03} {'loss': 0.1113, 'grad_norm': 1.8239766359329224, 'learning_rate': 5.621718523237427e-05, 'epoch': 3.09} {'loss': 0.1229, 'grad_norm': 1.0920981168746948, 'learning_rate': 5.456473555193242e-05, 'epoch': 3.14} {'loss': 0.1398, 'grad_norm': 0.7232789993286133, 'learning_rate': 5.290724144552379e-05, 'epoch': 3.2} {'loss': 0.1621, 'grad_norm': 0.825590193271637, 'learning_rate': 5.124653458690365e-05, 'epoch': 3.26} {'loss': 0.1444, 'grad_norm': 1.0252315998077393, 'learning_rate': 4.9584450200195156e-05, 'epoch': 3.31} {'loss': 0.1548, 'grad_norm': 1.245003581047058, 'learning_rate': 4.792282503180867e-05, 'epoch': 3.37} {'loss': 0.1479, 'grad_norm': 1.084923267364502, 'learning_rate': 4.626349532067879e-05, 'epoch': 3.43} {'loss': 0.1443, 'grad_norm': 1.438701868057251, 'learning_rate': 4.4608294769062075e-05, 'epoch': 3.48} {'loss': 0.1482, 'grad_norm': 1.642706036567688, 'learning_rate': 4.295905251613817e-05, 'epoch': 3.54} {'loss': 0.1474, 'grad_norm': 0.8712331056594849, 'learning_rate': 4.131759111665349e-05, 'epoch': 3.6} {'loss': 0.1431, 'grad_norm': 1.0291883945465088, 'learning_rate': 3.968572452684113e-05, 'epoch': 3.65} {'loss': 0.1325, 'grad_norm': 0.9933144450187683, 'learning_rate': 3.806525609984312e-05, 'epoch': 3.71} {'loss': 0.1558, 'grad_norm': 0.9645794630050659, 'learning_rate': 3.6457976592849754e-05, 'epoch': 3.77} {'loss': 0.1589, 'grad_norm': 1.2767914533615112, 'learning_rate': 3.486566218815871e-05, 'epoch': 3.82} {'loss': 0.1463, 'grad_norm': 1.3329861164093018, 'learning_rate': 3.329007253034063e-05, 'epoch': 3.88} {'loss': 0.1523, 'grad_norm': 1.0121214389801025, 'learning_rate': 3.173294878168025e-05, 'epoch': 3.94} {'loss': 0.1498, 'grad_norm': 1.350295901298523, 'learning_rate': 3.019601169804216e-05, 'epoch': 4.0} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-700 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-17 00:40:27,886 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-17 00:40:27,887 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-17 00:40:28,093 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-700/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-17 00:40:28,094 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-700/special_tokens_map.json 67%|██████▋ | 701/1050 [2:41:47<12:15:34, 126.46s/it] 67%|██████▋ | 702/1050 [2:42:41<10:07:04, 104.67s/it] 67%|██████▋ | 703/1050 [2:43:35<8:36:23, 89.29s/it] 67%|██████▋ | 704/1050 [2:44:29<7:34:34, 78.83s/it] 67%|██████▋ | 705/1050 [2:45:22<6:48:57, 71.12s/it] 67%|██████▋ | 706/1050 [2:46:17<6:20:06, 66.30s/it] 67%|██████▋ | 707/1050 [2:47:13<6:00:02, 62.98s/it] 67%|██████▋ | 708/1050 [2:48:06<5:42:59, 60.17s/it] 68%|██████▊ | 709/1050 [2:48:59<5:30:11, 58.10s/it] 68%|██████▊ | 710/1050 [2:49:54<5:23:32, 57.10s/it] 68%|██████▊ | 710/1050 [2:49:54<5:23:32, 57.10s/it] 68%|██████▊ | 711/1050 [2:50:47<5:15:55, 55.92s/it] 68%|██████▊ | 712/1050 [2:51:41<5:11:14, 55.25s/it] 68%|██████▊ | 713/1050 [2:52:34<5:06:52, 54.64s/it] 68%|██████▊ | 714/1050 [2:53:29<5:05:54, 54.63s/it] 68%|██████▊ | 715/1050 [2:54:22<5:03:06, 54.29s/it] 68%|██████▊ | 716/1050 [2:55:16<5:01:07, 54.09s/it] 68%|██████▊ | 717/1050 [2:56:09<4:59:09, 53.90s/it] 68%|██████▊ | 718/1050 [2:57:03<4:57:43, 53.81s/it] 68%|██████▊ | 719/1050 [2:57:57<4:57:26, 53.92s/it] 69%|██████▊ | 720/1050 [2:58:51<4:56:28, 53.90s/it] 69%|██████▊ | 720/1050 [2:58:51<4:56:28, 53.90s/it] 69%|██████▊ | 721/1050 [2:59:45<4:54:52, 53.78s/it] 69%|██████▉ | 722/1050 [3:00:38<4:53:51, 53.75s/it] 69%|██████▉ | 723/1050 [3:01:32<4:52:58, 53.76s/it] 69%|██████▉ | 724/1050 [3:02:25<4:51:06, 53.58s/it] 69%|██████▉ | 725/1050 [3:03:19<4:50:21, 53.60s/it] 69%|██████▉ | 726/1050 [3:04:12<4:48:51, 53.49s/it] 69%|██████▉ | 727/1050 [3:05:05<4:47:32, 53.41s/it] 69%|██████▉ | 728/1050 [3:06:00<4:49:01, 53.86s/it] 69%|██████▉ | 729/1050 [3:06:54<4:48:22, 53.90s/it] 70%|██████▉ | 730/1050 [3:07:49<4:49:41, 54.32s/it] 70%|██████▉ | 730/1050 [3:07:49<4:49:41, 54.32s/it] 70%|██████▉ | 731/1050 [3:08:44<4:48:22, 54.24s/it] 70%|██████▉ | 732/1050 [3:09:37<4:46:21, 54.03s/it] 70%|██████▉ | 733/1050 [3:10:32<4:46:28, 54.22s/it] 70%|██████▉ | 734/1050 [3:11:25<4:44:38, 54.05s/it] 70%|███████ | 735/1050 [3:12:18<4:41:42, 53.66s/it] 70%|███████ | 736/1050 [3:13:12<4:40:53, 53.67s/it] 70%|███████ | 737/1050 [3:14:06<4:40:33, 53.78s/it] 70%|███████ | 738/1050 [3:15:00<4:39:26, 53.74s/it] 70%|███████ | 739/1050 [3:15:54<4:40:04, 54.03s/it] 70%|███████ | 740/1050 [3:16:48<4:38:42, 53.94s/it] 70%|███████ | 740/1050 [3:16:48<4:38:42, 53.94s/it] 71%|███████ | 741/1050 [3:17:42<4:37:28, 53.88s/it] 71%|███████ | 742/1050 [3:18:35<4:35:38, 53.70s/it] 71%|███████ | 743/1050 [3:19:30<4:37:11, 54.18s/it] 71%|███████ | 744/1050 [3:20:24<4:36:18, 54.18s/it] 71%|███████ | 745/1050 [3:21:18<4:34:59, 54.10s/it] 71%|███████ | 746/1050 [3:22:12<4:33:47, 54.04s/it] 71%|███████ | 747/1050 [3:23:06<4:33:03, 54.07s/it] 71%|███████ | 748/1050 [3:24:01<4:32:25, 54.12s/it] 71%|███████▏ | 749/1050 [3:24:56<4:33:00, 54.42s/it] 71%|███████▏ | 750/1050 [3:25:50<4:32:06, 54.42s/it] 71%|███████▏ | 750/1050 [3:25:50<4:32:06, 54.42s/it] 72%|███████▏ | 751/1050 [3:26:44<4:30:22, 54.26s/it] 72%|███████▏ | 752/1050 [3:27:37<4:28:02, 53.97s/it] 72%|███████▏ | 753/1050 [3:28:30<4:25:51, 53.71s/it] 72%|███████▏ | 754/1050 [3:29:24<4:25:16, 53.77s/it] 72%|███████▏ | 755/1050 [3:30:19<4:25:09, 53.93s/it] 72%|███████▏ | 756/1050 [3:31:12<4:23:53, 53.86s/it] 72%|███████▏ | 757/1050 [3:32:07<4:24:43, 54.21s/it] 72%|███████▏ | 758/1050 [3:33:01<4:22:25, 53.92s/it] 72%|███████▏ | 759/1050 [3:33:53<4:19:14, 53.45s/it] 72%|███████▏ | 760/1050 [3:34:48<4:19:56, 53.78s/it] 72%|███████▏ | 760/1050 [3:34:48<4:19:56, 53.78s/it] 72%|███████▏ | 761/1050 [3:35:41<4:19:00, 53.77s/it] 73%|███████▎ | 762/1050 [3:36:35<4:18:13, 53.80s/it] 73%|███████▎ | 763/1050 [3:37:29<4:17:26, 53.82s/it] 73%|███████▎ | 764/1050 [3:38:24<4:18:05, 54.14s/it] 73%|███████▎ | 765/1050 [3:39:17<4:16:16, 53.95s/it] 73%|███████▎ | 766/1050 [3:40:11<4:15:03, 53.88s/it] 73%|███████▎ | 767/1050 [3:41:04<4:13:15, 53.69s/it] 73%|███████▎ | 768/1050 [3:41:58<4:11:55, 53.60s/it] 73%|███████▎ | 769/1050 [3:42:51<4:11:01, 53.60s/it] 73%|███████▎ | 770/1050 [3:43:46<4:11:43, 53.94s/it] 73%|███████▎ | 770/1050 [3:43:46<4:11:43, 53.94s/it] 73%|███████▎ | 771/1050 [3:44:41<4:11:29, 54.08s/it] 74%|███████▎ | 772/1050 [3:45:35<4:10:39, 54.10s/it] 74%|███████▎ | 773/1050 [3:46:28<4:09:01, 53.94s/it] 74%|███████▎ | 774/1050 [3:47:22<4:07:34, 53.82s/it] 74%|███████▍ | 775/1050 [3:48:15<4:05:49, 53.64s/it] 74%|███████▍ | 776/1050 [3:49:09<4:05:02, 53.66s/it] 74%|███████▍ | 777/1050 [3:50:03<4:04:38, 53.77s/it] 74%|███████▍ | 778/1050 [3:50:57<4:04:45, 53.99s/it] 74%|███████▍ | 779/1050 [3:51:52<4:04:14, 54.08s/it] 74%|███████▍ | 780/1050 [3:52:45<4:02:47, 53.95s/it] 74%|███████▍ | 780/1050 [3:52:45<4:02:47, 53.95s/it] 74%|███████▍ | 781/1050 [3:53:39<4:01:00, 53.76s/it] 74%|███████▍ | 782/1050 [3:54:32<4:00:15, 53.79s/it] 75%|███████▍ | 783/1050 [3:55:26<3:58:52, 53.68s/it] 75%|███████▍ | 784/1050 [3:56:19<3:57:41, 53.62s/it] 75%|███████▍ | 785/1050 [3:57:13<3:57:25, 53.76s/it] 75%|███████▍ | 786/1050 [3:58:06<3:54:55, 53.39s/it] 75%|███████▍ | 787/1050 [3:59:00<3:54:39, 53.54s/it] 75%|███████▌ | 788/1050 [3:59:53<3:53:06, 53.38s/it] 75%|███████▌ | 789/1050 [4:00:47<3:52:57, 53.55s/it] 75%|███████▌ | 790/1050 [4:01:40<3:52:14, 53.60s/it] 75%|███████▌ | 790/1050 [4:01:40<3:52:14, 53.60s/it] 75%|███████▌ | 791/1050 [4:02:34<3:51:29, 53.63s/it] 75%|███████▌ | 792/1050 [4:03:29<3:51:33, 53.85s/it] 76%|███████▌ | 793/1050 [4:04:23<3:50:50, 53.89s/it] 76%|███████▌ | 794/1050 [4:05:16<3:48:51, 53.64s/it] 76%|███████▌ | 795/1050 [4:06:10<3:48:24, 53.74s/it] 76%|███████▌ | 796/1050 [4:07:04<3:47:57, 53.85s/it] 76%|███████▌ | 797/1050 [4:07:58<3:47:06, 53.86s/it] 76%|███████▌ | 798/1050 [4:08:52<3:46:59, 54.05s/it] 76%|███████▌ | 799/1050 [4:09:46<3:46:11, 54.07s/it] 76%|███████▌ | 800/1050 [4:10:39<3:44:12, 53.81s/it] 76%|███████▌ | 800/1050 [4:10:39<3:44:12, 53.81s/it] 76%|███████▋ | 801/1050 [4:11:34<3:44:22, 54.07s/it] 76%|███████▋ | 802/1050 [4:12:27<3:42:08, 53.75s/it] 76%|███████▋ | 803/1050 [4:13:20<3:40:29, 53.56s/it] 77%|███████▋ | 804/1050 [4:14:14<3:40:29, 53.78s/it] 77%|███████▋ | 805/1050 [4:15:08<3:38:51, 53.60s/it] 77%|███████▋ | 806/1050 [4:16:01<3:38:02, 53.62s/it] 77%|███████▋ | 807/1050 [4:16:55<3:37:11, 53.63s/it] 77%|███████▋ | 808/1050 [4:17:49<3:36:40, 53.72s/it] 77%|███████▋ | 809/1050 [4:18:43<3:36:21, 53.87s/it] 77%|███████▋ | 810/1050 [4:19:38<3:36:48, 54.20s/it] 77%|███████▋ | 810/1050 [4:19:38<3:36:48, 54.20s/it] 77%|███████▋ | 811/1050 [4:20:32<3:35:34, 54.12s/it] 77%|███████▋ | 812/1050 [4:21:27<3:35:11, 54.25s/it] 77%|███████▋ | 813/1050 [4:22:20<3:33:17, 54.00s/it] 78%|███████▊ | 814/1050 [4:23:13<3:31:09, 53.69s/it] 78%|███████▊ | 815/1050 [4:24:06<3:29:58, 53.61s/it] 78%|███████▊ | 816/1050 [4:25:00<3:29:38, 53.76s/it] 78%|███████▊ | 817/1050 [4:25:55<3:29:28, 53.94s/it] 78%|███████▊ | 818/1050 [4:26:49<3:28:27, 53.91s/it] 78%|███████▊ | 819/1050 [4:27:44<3:28:56, 54.27s/it] 78%|███████▊ | 820/1050 [4:28:38<3:27:29, 54.13s/it] 78%|███████▊ | 820/1050 [4:28:38<3:27:29, 54.13s/it] 78%|███████▊ | 821/1050 [4:29:31<3:26:07, 54.01s/it] 78%|███████▊ | 822/1050 [4:30:24<3:23:57, 53.67s/it] 78%|███████▊ | 823/1050 [4:31:18<3:22:58, 53.65s/it] 78%|███████▊ | 824/1050 [4:32:12<3:22:47, 53.84s/it] 79%|███████▊ | 825/1050 [4:33:06<3:21:48, 53.81s/it] 79%|███████▊ | 826/1050 [4:34:00<3:21:08, 53.88s/it] 79%|███████▉ | 827/1050 [4:34:53<3:20:01, 53.82s/it] 79%|███████▉ | 828/1050 [4:35:47<3:19:04, 53.80s/it] 79%|███████▉ | 829/1050 [4:36:41<3:17:39, 53.66s/it] 79%|███████▉ | 830/1050 [4:37:35<3:17:15, 53.80s/it] 79%|███████▉ | 830/1050 [4:37:35<3:17:15, 53.80s/it] 79%|███████▉ | 831/1050 [4:38:29<3:16:49, 53.92s/it] 79%|███████▉ | 832/1050 [4:39:22<3:14:37, 53.57s/it] 79%|███████▉ | 833/1050 [4:40:16<3:14:40, 53.83s/it] 79%|███████▉ | 834/1050 [4:41:10<3:13:39, 53.79s/it] 80%|███████▉ | 835/1050 [4:42:03<3:12:08, 53.62s/it] 80%|███████▉ | 836/1050 [4:42:57<3:12:09, 53.88s/it] 80%|███████▉ | 837/1050 [4:43:51<3:10:27, 53.65s/it] 80%|███████▉ | 838/1050 [4:44:44<3:09:11, 53.54s/it] 80%|███████▉ | 839/1050 [4:45:38<3:08:29, 53.60s/it] 80%|████████ | 840/1050 [4:46:32<3:08:30, 53.86s/it] 80%|████████ | 840/1050 [4:46:32<3:08:30, 53.86s/it] 80%|████████ | 841/1050 [4:47:26<3:07:23, 53.80s/it] 80%|████████ | 842/1050 [4:48:20<3:06:38, 53.84s/it] 80%|████████ | 843/1050 [4:49:14<3:05:49, 53.86s/it] 80%|████████ | 844/1050 [4:50:08<3:05:48, 54.12s/it] 80%|████████ | 845/1050 [4:51:02<3:04:43, 54.07s/it] 81%|████████ | 846/1050 [4:51:56<3:03:17, 53.91s/it] 81%|████████ | 847/1050 [4:52:49<3:02:10, 53.84s/it] 81%|████████ | 848/1050 [4:53:44<3:02:26, 54.19s/it] 81%|████████ | 849/1050 [4:54:39<3:01:25, 54.16s/it] 81%|████████ | 850/1050 [4:55:32<2:59:58, 53.99s/it] 81%|████████ | 850/1050 [4:55:32<2:59:58, 53.99s/it] 81%|████████ | 851/1050 [4:56:25<2:58:17, 53.75s/it] 81%|████████ | 852/1050 [4:57:19<2:57:32, 53.80s/it] 81%|████████ | 853/1050 [4:58:12<2:55:56, 53.59s/it] 81%|████████▏ | 854/1050 [4:59:06<2:54:43, 53.49s/it] 81%|████████▏ | 855/1050 [4:59:59<2:54:06, 53.57s/it] 82%|████████▏ | 856/1050 [5:00:54<2:53:50, 53.77s/it] 82%|████████▏ | 857/1050 [5:01:48<2:53:26, 53.92s/it] 82%|████████▏ | 858/1050 [5:02:42<2:52:53, 54.03s/it] 82%|████████▏ | 859/1050 [5:03:36<2:52:01, 54.04s/it] 82%|████████▏ | 860/1050 [5:04:29<2:50:15, 53.77s/it] 82%|████████▏ | 860/1050 [5:04:29<2:50:15, 53.77s/it] 82%|████████▏ | 861/1050 [5:05:23<2:49:14, 53.73s/it] 82%|████████▏ | 862/1050 [5:06:17<2:48:49, 53.88s/it] 82%|████████▏ | 863/1050 [5:07:11<2:47:59, 53.90s/it] 82%|████████▏ | 864/1050 [5:08:06<2:47:37, 54.07s/it] 82%|████████▏ | 865/1050 [5:08:59<2:46:00, 53.84s/it] 82%|████████▏ | 866/1050 [5:09:52<2:44:16, 53.57s/it] 83%|████████▎ | 867/1050 [5:10:46<2:43:25, 53.58s/it] 83%|████████▎ | 868/1050 [5:11:39<2:42:21, 53.53s/it] 83%|████████▎ | 869/1050 [5:12:32<2:41:28, 53.53s/it] 83%|████████▎ | 870/1050 [5:13:25<2:40:08, 53.38s/it] 83%|████████▎ | 870/1050 [5:13:25<2:40:08, 53.38s/it] 83%|████████▎ | 871/1050 [5:14:19<2:39:20, 53.41s/it] 83%|████████▎ | 872/1050 [5:15:12<2:37:55, 53.23s/it] 83%|████████▎ | 873/1050 [5:16:05<2:36:59, 53.22s/it] 83%|████████▎ | 874/1050 [5:16:58<2:35:49, 53.12s/it] 83%|████████▎ | 875/1050 [5:17:51<2:34:52, 53.10s/it][INFO|trainer.py:3719] 2024-07-17 03:17:20,557 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-17 03:17:20,557 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-17 03:17:20,557 >> Batch size = 1 {'eval_loss': 0.26931771636009216, 'eval_accuracy': 0.9072, 'eval_runtime': 245.5232, 'eval_samples_per_second': 10.182, 'eval_steps_per_second': 10.182, 'epoch': 4.0} {'loss': 0.1035, 'grad_norm': 1.4912649393081665, 'learning_rate': 2.8680959727287317e-05, 'epoch': 4.05} {'loss': 0.1001, 'grad_norm': 1.1092299222946167, 'learning_rate': 2.718946713234185e-05, 'epoch': 4.11} {'loss': 0.095, 'grad_norm': 0.7442717552185059, 'learning_rate': 2.5723182140992387e-05, 'epoch': 4.17} {'loss': 0.0811, 'grad_norm': 0.8044183254241943, 'learning_rate': 2.428372512445233e-05, 'epoch': 4.22} {'loss': 0.0838, 'grad_norm': 1.1196448802947998, 'learning_rate': 2.2872686806712035e-05, 'epoch': 4.28} {'loss': 0.0949, 'grad_norm': 1.1083009243011475, 'learning_rate': 2.1491626506651914e-05, 'epoch': 4.34} {'loss': 0.095, 'grad_norm': 1.3349604606628418, 'learning_rate': 2.0142070414860704e-05, 'epoch': 4.39} {'loss': 0.0852, 'grad_norm': 0.8541048765182495, 'learning_rate': 1.8825509907063327e-05, 'epoch': 4.45} {'loss': 0.0886, 'grad_norm': 0.9618383646011353, 'learning_rate': 1.7543399896022405e-05, 'epoch': 4.51} {'loss': 0.097, 'grad_norm': 0.684533417224884, 'learning_rate': 1.629715722373423e-05, 'epoch': 4.56} {'loss': 0.1029, 'grad_norm': 1.2351458072662354, 'learning_rate': 1.5088159095696363e-05, 'epoch': 4.62} {'loss': 0.1039, 'grad_norm': 1.2963201999664307, 'learning_rate': 1.3917741558976894e-05, 'epoch': 4.68} {'loss': 0.0964, 'grad_norm': 0.7727164030075073, 'learning_rate': 1.2787198025767416e-05, 'epoch': 4.73} {'loss': 0.0828, 'grad_norm': 1.4305059909820557, 'learning_rate': 1.1697777844051105e-05, 'epoch': 4.79} {'loss': 0.0982, 'grad_norm': 1.1239262819290161, 'learning_rate': 1.0650684916965559e-05, 'epoch': 4.85} {'loss': 0.08, 'grad_norm': 0.9413287043571472, 'learning_rate': 9.647076372386194e-06, 'epoch': 4.9} {'loss': 0.0825, 'grad_norm': 1.1029815673828125, 'learning_rate': 8.688061284200266e-06, 'epoch': 4.96} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-875 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-17 03:21:27,144 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-17 03:21:27,144 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-17 03:21:27,324 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-875/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-17 03:21:27,325 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-875/special_tokens_map.json 83%|████████▎ | 876/1050 [5:22:51<6:08:45, 127.16s/it] 84%|████████▎ | 877/1050 [5:23:39<4:58:11, 103.42s/it] 84%|████████▎ | 878/1050 [5:24:33<4:13:42, 88.50s/it] 84%|████████▎ | 879/1050 [5:25:26<3:42:04, 77.92s/it] 84%|████████▍ | 880/1050 [5:26:21<3:21:24, 71.08s/it] 84%|████████▍ | 880/1050 [5:26:21<3:21:24, 71.08s/it] 84%|████████▍ | 881/1050 [5:27:15<3:05:55, 66.01s/it] 84%|████████▍ | 882/1050 [5:28:08<2:53:56, 62.12s/it] 84%|████████▍ | 883/1050 [5:29:02<2:46:16, 59.74s/it] 84%|████████▍ | 884/1050 [5:29:57<2:40:49, 58.13s/it] 84%|████████▍ | 885/1050 [5:30:51<2:36:38, 56.96s/it] 84%|████████▍ | 886/1050 [5:31:45<2:33:08, 56.03s/it] 84%|████████▍ | 887/1050 [5:32:38<2:30:03, 55.23s/it] 85%|████████▍ | 888/1050 [5:33:32<2:27:37, 54.68s/it] 85%|████████▍ | 889/1050 [5:34:25<2:25:36, 54.26s/it] 85%|████████▍ | 890/1050 [5:35:19<2:24:43, 54.27s/it] 85%|████████▍ | 890/1050 [5:35:19<2:24:43, 54.27s/it] 85%|████████▍ | 891/1050 [5:36:14<2:23:53, 54.30s/it] 85%|████████▍ | 892/1050 [5:37:07<2:22:27, 54.10s/it] 85%|████████▌ | 893/1050 [5:38:01<2:21:29, 54.07s/it] 85%|████████▌ | 894/1050 [5:38:55<2:20:02, 53.86s/it] 85%|████████▌ | 895/1050 [5:39:48<2:19:05, 53.85s/it] 85%|████████▌ | 896/1050 [5:40:42<2:18:10, 53.83s/it] 85%|████████▌ | 897/1050 [5:41:37<2:17:40, 53.99s/it] 86%|████████▌ | 898/1050 [5:42:31<2:17:01, 54.09s/it] 86%|████████▌ | 899/1050 [5:43:24<2:15:43, 53.93s/it] 86%|████████▌ | 900/1050 [5:44:18<2:14:51, 53.94s/it] 86%|████████▌ | 900/1050 [5:44:18<2:14:51, 53.94s/it] 86%|████████▌ | 901/1050 [5:45:11<2:13:15, 53.66s/it] 86%|████████▌ | 902/1050 [5:46:05<2:12:14, 53.61s/it] 86%|████████▌ | 903/1050 [5:46:59<2:11:59, 53.88s/it] 86%|████████▌ | 904/1050 [5:47:53<2:10:56, 53.81s/it] 86%|████████▌ | 905/1050 [5:48:47<2:10:10, 53.87s/it] 86%|████████▋ | 906/1050 [5:49:40<2:08:56, 53.72s/it] 86%|████████▋ | 907/1050 [5:50:34<2:07:47, 53.62s/it] 86%|████████▋ | 908/1050 [5:51:28<2:07:38, 53.93s/it] 87%|████████▋ | 909/1050 [5:52:23<2:06:57, 54.03s/it] 87%|████████▋ | 910/1050 [5:53:16<2:05:48, 53.92s/it] 87%|████████▋ | 910/1050 [5:53:16<2:05:48, 53.92s/it] 87%|████████▋ | 911/1050 [5:54:10<2:04:59, 53.96s/it] 87%|████████▋ | 912/1050 [5:55:04<2:03:42, 53.78s/it] 87%|████████▋ | 913/1050 [5:55:58<2:02:55, 53.84s/it] 87%|████████▋ | 914/1050 [5:56:51<2:01:37, 53.66s/it] 87%|████████▋ | 915/1050 [5:57:45<2:01:01, 53.79s/it] 87%|████████▋ | 916/1050 [5:58:39<2:00:05, 53.77s/it] 87%|████████▋ | 917/1050 [5:59:32<1:58:45, 53.58s/it] 87%|████████▋ | 918/1050 [6:00:26<1:58:09, 53.71s/it] 88%|████████▊ | 919/1050 [6:01:20<1:57:33, 53.84s/it] 88%|████████▊ | 920/1050 [6:02:13<1:56:13, 53.64s/it] 88%|████████▊ | 920/1050 [6:02:13<1:56:13, 53.64s/it] 88%|████████▊ | 921/1050 [6:03:07<1:55:21, 53.66s/it] 88%|████████▊ | 922/1050 [6:04:02<1:55:28, 54.13s/it] 88%|████████▊ | 923/1050 [6:04:55<1:53:45, 53.75s/it] 88%|████████▊ | 924/1050 [6:05:49<1:52:53, 53.76s/it] 88%|████████▊ | 925/1050 [6:06:42<1:51:49, 53.68s/it] 88%|████████▊ | 926/1050 [6:07:36<1:50:41, 53.56s/it] 88%|████████▊ | 927/1050 [6:08:29<1:49:57, 53.64s/it] 88%|████████▊ | 928/1050 [6:09:23<1:49:08, 53.68s/it] 88%|████████▊ | 929/1050 [6:10:17<1:48:36, 53.85s/it] 89%|████████▊ | 930/1050 [6:11:11<1:47:21, 53.68s/it] 89%|████████▊ | 930/1050 [6:11:11<1:47:21, 53.68s/it] 89%|████████▊ | 931/1050 [6:12:05<1:46:44, 53.82s/it] 89%|████████▉ | 932/1050 [6:12:58<1:45:17, 53.54s/it] 89%|████████▉ | 933/1050 [6:13:52<1:44:42, 53.69s/it] 89%|████████▉ | 934/1050 [6:14:44<1:43:02, 53.29s/it] 89%|████████▉ | 935/1050 [6:15:38<1:42:27, 53.46s/it] 89%|████████▉ | 936/1050 [6:16:32<1:42:00, 53.69s/it] 89%|████████▉ | 937/1050 [6:17:26<1:40:52, 53.56s/it] 89%|████████▉ | 938/1050 [6:18:19<1:40:10, 53.67s/it] 89%|████████▉ | 939/1050 [6:19:12<1:38:57, 53.49s/it] 90%|████████▉ | 940/1050 [6:20:07<1:38:31, 53.74s/it] 90%|████████▉ | 940/1050 [6:20:07<1:38:31, 53.74s/it] 90%|████████▉ | 941/1050 [6:21:01<1:37:47, 53.83s/it] 90%|████████▉ | 942/1050 [6:21:55<1:36:59, 53.89s/it] 90%|████████▉ | 943/1050 [6:22:50<1:36:39, 54.20s/it] 90%|████████▉ | 944/1050 [6:23:43<1:35:07, 53.85s/it] 90%|█████████ | 945/1050 [6:24:37<1:34:26, 53.96s/it] 90%|█████████ | 946/1050 [6:25:31<1:33:19, 53.84s/it] 90%|█████████ | 947/1050 [6:26:24<1:31:56, 53.56s/it] 90%|█████████ | 948/1050 [6:27:17<1:30:49, 53.43s/it] 90%|█████████ | 949/1050 [6:28:10<1:29:57, 53.44s/it] 90%|█████████ | 950/1050 [6:29:03<1:28:43, 53.23s/it] 90%|█████████ | 950/1050 [6:29:03<1:28:43, 53.23s/it] 91%|█████████ | 951/1050 [6:29:58<1:28:32, 53.66s/it] 91%|█████████ | 952/1050 [6:30:52<1:28:05, 53.93s/it] 91%|█████████ | 953/1050 [6:31:46<1:27:05, 53.87s/it] 91%|█████████ | 954/1050 [6:32:40<1:26:21, 53.97s/it] 91%|█████████ | 955/1050 [6:33:33<1:25:01, 53.70s/it] 91%|█████████ | 956/1050 [6:34:26<1:23:54, 53.56s/it] 91%|█████████ | 957/1050 [6:35:20<1:23:12, 53.68s/it] 91%|█████████ | 958/1050 [6:36:14<1:22:10, 53.59s/it] 91%|█████████▏| 959/1050 [6:37:07<1:21:00, 53.41s/it] 91%|█████████▏| 960/1050 [6:38:01<1:20:25, 53.61s/it] 91%|█████████▏| 960/1050 [6:38:01<1:20:25, 53.61s/it] 92%|█████████▏| 961/1050 [6:38:56<1:20:12, 54.08s/it] 92%|█████████▏| 962/1050 [6:39:50<1:19:29, 54.20s/it] 92%|█████████▏| 963/1050 [6:40:44<1:18:24, 54.07s/it] 92%|█████████▏| 964/1050 [6:41:37<1:17:00, 53.72s/it] 92%|█████████▏| 965/1050 [6:42:30<1:15:45, 53.47s/it] 92%|█████████▏| 966/1050 [6:43:23<1:14:45, 53.40s/it] 92%|█████████▏| 967/1050 [6:44:17<1:13:57, 53.46s/it] 92%|█████████▏| 968/1050 [6:45:11<1:13:26, 53.74s/it] 92%|█████████▏| 969/1050 [6:46:05<1:12:40, 53.83s/it] 92%|█████████▏| 970/1050 [6:46:59<1:11:37, 53.72s/it] 92%|█████████▏| 970/1050 [6:46:59<1:11:37, 53.72s/it] 92%|█████████▏| 971/1050 [6:47:52<1:10:37, 53.64s/it] 93%|█████████▎| 972/1050 [6:48:47<1:10:13, 54.02s/it] 93%|█████████▎| 973/1050 [6:49:42<1:09:30, 54.16s/it] 93%|█████████▎| 974/1050 [6:50:35<1:08:28, 54.06s/it] 93%|█████████▎| 975/1050 [6:51:29<1:07:30, 54.01s/it] 93%|█████████▎| 976/1050 [6:52:22<1:06:15, 53.72s/it] 93%|█████████▎| 977/1050 [6:53:15<1:04:59, 53.42s/it] 93%|█████████▎| 978/1050 [6:54:10<1:04:35, 53.83s/it] 93%|█████████▎| 979/1050 [6:55:03<1:03:29, 53.65s/it] 93%|█████████▎| 980/1050 [6:55:57<1:02:44, 53.78s/it] 93%|█████████▎| 980/1050 [6:55:57<1:02:44, 53.78s/it] 93%|█████████▎| 981/1050 [6:56:50<1:01:42, 53.66s/it] 94%|█████████▎| 982/1050 [6:57:44<1:00:39, 53.53s/it] 94%|█████████▎| 983/1050 [6:58:37<59:47, 53.55s/it] 94%|█████████▎| 984/1050 [6:59:32<59:16, 53.88s/it] 94%|█████████▍| 985/1050 [7:00:27<58:48, 54.29s/it] 94%|█████████▍| 986/1050 [7:01:21<57:53, 54.28s/it] 94%|█████████▍| 987/1050 [7:02:16<57:05, 54.37s/it] 94%|█████████▍| 988/1050 [7:03:11<56:16, 54.46s/it] 94%|█████████▍| 989/1050 [7:04:05<55:09, 54.26s/it] 94%|█████████▍| 990/1050 [7:04:59<54:13, 54.23s/it] 94%|█████████▍| 990/1050 [7:04:59<54:13, 54.23s/it] 94%|█████████▍| 991/1050 [7:05:52<53:00, 53.91s/it] 94%|█████████▍| 992/1050 [7:06:46<52:06, 53.91s/it] 95%|█████████▍| 993/1050 [7:07:39<51:06, 53.80s/it] 95%|█████████▍| 994/1050 [7:08:32<49:56, 53.52s/it] 95%|█████████▍| 995/1050 [7:09:26<49:10, 53.64s/it] 95%|█████████▍| 996/1050 [7:10:20<48:17, 53.66s/it] 95%|█████████▍| 997/1050 [7:11:13<47:20, 53.59s/it] 95%|█████████▌| 998/1050 [7:12:07<46:23, 53.53s/it] 95%|█████████▌| 999/1050 [7:13:01<45:38, 53.69s/it] 95%|█████████▌| 1000/1050 [7:13:55<44:48, 53.78s/it] 95%|█████████▌| 1000/1050 [7:13:55<44:48, 53.78s/it] 95%|█████████▌| 1001/1050 [7:14:48<43:52, 53.73s/it] 95%|█████████▌| 1002/1050 [7:15:42<42:59, 53.73s/it] 96%|█████████▌| 1003/1050 [7:16:35<41:58, 53.59s/it] 96%|█████████▌| 1004/1050 [7:17:29<41:08, 53.67s/it] 96%|█████████▌| 1005/1050 [7:18:23<40:18, 53.74s/it] 96%|█████████▌| 1006/1050 [7:19:16<39:19, 53.63s/it] 96%|█████████▌| 1007/1050 [7:20:10<38:29, 53.72s/it] 96%|█████████▌| 1008/1050 [7:21:04<37:40, 53.83s/it] 96%|█████████▌| 1009/1050 [7:21:58<36:46, 53.81s/it] 96%|█████████▌| 1010/1050 [7:22:52<35:50, 53.76s/it] 96%|█████████▌| 1010/1050 [7:22:52<35:50, 53.76s/it] 96%|█████████▋| 1011/1050 [7:23:45<34:44, 53.45s/it] 96%|█████████▋| 1012/1050 [7:24:38<33:49, 53.41s/it] 96%|█████████▋| 1013/1050 [7:25:31<32:57, 53.45s/it] 97%|█████████▋| 1014/1050 [7:26:25<32:03, 53.44s/it] 97%|█████████▋| 1015/1050 [7:27:19<31:20, 53.74s/it] 97%|█████████▋| 1016/1050 [7:28:13<30:32, 53.89s/it] 97%|█████████▋| 1017/1050 [7:29:08<29:39, 53.94s/it] 97%|█████████▋| 1018/1050 [7:30:02<28:47, 53.99s/it] 97%|█████████▋| 1019/1050 [7:30:55<27:46, 53.77s/it] 97%|█████████▋| 1020/1050 [7:31:48<26:50, 53.67s/it] 97%|█████████▋| 1020/1050 [7:31:48<26:50, 53.67s/it] 97%|█████████▋| 1021/1050 [7:32:42<25:57, 53.69s/it] 97%|█████████▋| 1022/1050 [7:33:37<25:10, 53.96s/it] 97%|█████████▋| 1023/1050 [7:34:31<24:20, 54.09s/it] 98%|█████████▊| 1024/1050 [7:35:24<23:17, 53.76s/it] 98%|█████████▊| 1025/1050 [7:36:17<22:16, 53.45s/it] 98%|█████████▊| 1026/1050 [7:37:11<21:25, 53.54s/it] 98%|█████████▊| 1027/1050 [7:38:05<20:38, 53.84s/it] 98%|█████████▊| 1028/1050 [7:38:58<19:41, 53.69s/it] 98%|█████████▊| 1029/1050 [7:39:53<18:50, 53.83s/it] 98%|█████████▊| 1030/1050 [7:40:46<17:55, 53.79s/it] 98%|█████████▊| 1030/1050 [7:40:46<17:55, 53.79s/it] 98%|█████████▊| 1031/1050 [7:41:40<17:03, 53.89s/it] 98%|█████████▊| 1032/1050 [7:42:35<16:11, 53.98s/it] 98%|█████████▊| 1033/1050 [7:43:28<15:15, 53.87s/it] 98%|█████████▊| 1034/1050 [7:44:21<14:18, 53.66s/it] 99%|█████████▊| 1035/1050 [7:45:15<13:23, 53.55s/it] 99%|█████████▊| 1036/1050 [7:46:08<12:28, 53.44s/it] 99%|█████████▉| 1037/1050 [7:47:01<11:35, 53.49s/it] 99%|█████████▉| 1038/1050 [7:47:55<10:43, 53.65s/it] 99%|█████████▉| 1039/1050 [7:48:49<09:50, 53.70s/it] 99%|█████████▉| 1040/1050 [7:49:45<09:02, 54.21s/it] 99%|█████████▉| 1040/1050 [7:49:45<09:02, 54.21s/it] 99%|█████████▉| 1041/1050 [7:50:38<08:04, 53.80s/it] 99%|█████████▉| 1042/1050 [7:51:32<07:12, 54.10s/it] 99%|█████████▉| 1043/1050 [7:52:26<06:16, 53.83s/it] 99%|█████████▉| 1044/1050 [7:53:20<05:24, 54.03s/it] 100%|█████████▉| 1045/1050 [7:54:14<04:30, 54.11s/it] 100%|█████████▉| 1046/1050 [7:55:07<03:35, 53.79s/it] 100%|█████████▉| 1047/1050 [7:56:01<02:41, 53.80s/it] 100%|█████████▉| 1048/1050 [7:56:55<01:47, 53.75s/it] 100%|█████████▉| 1049/1050 [7:57:48<00:53, 53.60s/it] 100%|██████████| 1050/1050 [7:58:41<00:00, 53.52s/it] 100%|██████████| 1050/1050 [7:58:41<00:00, 53.52s/it][INFO|trainer.py:3719] 2024-07-17 05:58:11,043 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-17 05:58:11,043 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-17 05:58:11,043 >> Batch size = 1 {'eval_loss': 0.32899829745292664, 'eval_accuracy': 0.9088, 'eval_runtime': 245.605, 'eval_samples_per_second': 10.179, 'eval_steps_per_second': 10.179, 'epoch': 4.99} {'loss': 0.0743, 'grad_norm': 0.6363257765769958, 'learning_rate': 7.774699446684608e-06, 'epoch': 5.02} {'loss': 0.0537, 'grad_norm': 0.5979920029640198, 'learning_rate': 6.908000203341802e-06, 'epoch': 5.08} {'loss': 0.0585, 'grad_norm': 0.6441344618797302, 'learning_rate': 6.088921331488568e-06, 'epoch': 5.13} {'loss': 0.0493, 'grad_norm': 0.7351667881011963, 'learning_rate': 5.318367983829392e-06, 'epoch': 5.19} {'loss': 0.0546, 'grad_norm': 1.1642309427261353, 'learning_rate': 4.597191688184754e-06, 'epoch': 5.25} {'loss': 0.0576, 'grad_norm': 0.722756028175354, 'learning_rate': 3.9261894064796135e-06, 'epoch': 5.3} {'loss': 0.0578, 'grad_norm': 0.6322811841964722, 'learning_rate': 3.306102654031823e-06, 'epoch': 5.36} {'loss': 0.0537, 'grad_norm': 0.5810645818710327, 'learning_rate': 2.737616680113758e-06, 'epoch': 5.42} {'loss': 0.0505, 'grad_norm': 0.933347225189209, 'learning_rate': 2.221359710692961e-06, 'epoch': 5.47} {'loss': 0.059, 'grad_norm': 0.9870530962944031, 'learning_rate': 1.757902254188254e-06, 'epoch': 5.53} {'loss': 0.054, 'grad_norm': 0.7121208906173706, 'learning_rate': 1.3477564710088098e-06, 'epoch': 5.59} {'loss': 0.058, 'grad_norm': 1.0721352100372314, 'learning_rate': 9.913756075728087e-07, 'epoch': 5.64} {'loss': 0.0494, 'grad_norm': 0.7931228280067444, 'learning_rate': 6.891534954310885e-07, 'epoch': 5.7} {'loss': 0.0411, 'grad_norm': 0.6202746629714966, 'learning_rate': 4.4142411604936597e-07, 'epoch': 5.76} {'loss': 0.0458, 'grad_norm': 0.6872019171714783, 'learning_rate': 2.4846123172992954e-07, 'epoch': 5.81} {'loss': 0.0637, 'grad_norm': 0.7417372465133667, 'learning_rate': 1.1047808308075058e-07, 'epoch': 5.87} {'loss': 0.0523, 'grad_norm': 0.6148027777671814, 'learning_rate': 2.7627153366222013e-08, 'epoch': 5.93} {'loss': 0.0514, 'grad_norm': 0.6213644742965698, 'learning_rate': 0.0, 'epoch': 5.99} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-1050 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-17 06:02:17,272 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-17 06:02:17,273 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-17 06:02:17,455 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-1050/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-17 06:02:17,457 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p2_full/checkpoint-1050/special_tokens_map.json [INFO|trainer.py:2329] 2024-07-17 06:02:17,944 >> Training completed. Do not forget to share your model on huggingface.co/models =) 100%|██████████| 1050/1050 [8:02:48<00:00, 53.52s/it] 100%|██████████| 1050/1050 [8:02:48<00:00, 27.59s/it] [INFO|trainer.py:3410] 2024-07-17 06:02:17,949 >> Saving model checkpoint to saves/llama3-8b/lora/sft_bf16_p2_full /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-17 06:02:18,851 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-17 06:02:18,851 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|tokenization_utils_base.py:2513] 2024-07-17 06:02:19,033 >> tokenizer config file saved in saves/llama3-8b/lora/sft_bf16_p2_full/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-17 06:02:19,034 >> Special tokens file saved in saves/llama3-8b/lora/sft_bf16_p2_full/special_tokens_map.json [INFO|trainer.py:3719] 2024-07-17 06:02:19,449 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-17 06:02:19,449 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-17 06:02:19,449 >> Batch size = 1 {'eval_loss': 0.4251042604446411, 'eval_accuracy': 0.9049666666666666, 'eval_runtime': 245.4316, 'eval_samples_per_second': 10.186, 'eval_steps_per_second': 10.186, 'epoch': 5.99} {'train_runtime': 28977.896, 'train_samples_per_second': 4.659, 'train_steps_per_second': 0.036, 'train_loss': 0.04824516418434325, 'epoch': 5.99} ***** train metrics ***** epoch = 5.9851 total_flos = 3499703917GF train_loss = 0.0482 train_runtime = 8:02:57.89 train_samples_per_second = 4.659 train_steps_per_second = 0.036 Figure saved at: saves/llama3-8b/lora/sft_bf16_p2_full/training_loss.png Figure saved at: saves/llama3-8b/lora/sft_bf16_p2_full/training_eval_loss.png Figure saved at: saves/llama3-8b/lora/sft_bf16_p2_full/training_eval_accuracy.png 0%| | 0/2500 [00:00> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.9049666666666666}]} ***** eval metrics ***** epoch = 5.9851 eval_accuracy = 0.905 eval_loss = 0.4251 eval_runtime = 0:04:07.46 eval_samples_per_second = 10.102 eval_steps_per_second = 10.102 wandb: - 0.014 MB of 0.014 MB uploaded wandb: \ 0.014 MB of 0.059 MB uploaded wandb: | 0.059 MB of 0.059 MB uploaded wandb: wandb: Run history: wandb: eval/accuracy ▅█▁▁ wandb: eval/loss ▁▄██ wandb: eval/runtime ▁▂▁█ wandb: eval/samples_per_second █▇█▁ wandb: eval/steps_per_second █▇█▁ wandb: train/epoch ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████ wandb: train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████ wandb: train/grad_norm ▃█▄▂▄▅▆▇▃▃▃▅▃▅▆▂▂▄▅▃▃▅▅▂▄▃▄▁▁▂▂▁▁▃▂▄▁▂▂▁ wandb: train/learning_rate ███▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ wandb: train/loss ▄▅▆█▇█▇▇▇▆██▇▇▅▄▃▃▄▄▄▅▅▄▄▃▃▂▂▁▂▂▂▂▂▂▁▁▂▂ wandb: wandb: Run summary: wandb: eval/accuracy 0.90497 wandb: eval/loss 0.4251 wandb: eval/runtime 247.467 wandb: eval/samples_per_second 10.102 wandb: eval/steps_per_second 10.102 wandb: total_flos 3.757778467594961e+18 wandb: train/epoch 5.98507 wandb: train/global_step 1050 wandb: train/grad_norm 0.62136 wandb: train/learning_rate 0.0 wandb: train/loss 0.0514 wandb: train_loss 0.04825 wandb: train_runtime 28977.896 wandb: train_samples_per_second 4.659 wandb: train_steps_per_second 0.036 wandb: wandb: 🚀 View run llama3_8b_p2_full at: https://wandb.ai/inflaton-ai/huggingface/runs/b1vc6sbw wandb: ⭐️ View project at: https://wandb.ai/inflaton-ai/huggingface wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240716_215921-b1vc6sbw/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning Wed Jul 17 06:06:35 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40 On | 00000000:01:00.0 Off | 0 | | N/A 61C P0 96W / 300W | 1MiB / 46068MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Linux lexicon 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Thu Jun 6 09:41:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux NAME="Rocky Linux" VERSION="8.10 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" SUPPORT_END="2029-05-31" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.10" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.10" Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 CPU MHz: 2450.000 CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4890.67 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm MemTotal: 527669560 kB Eval shenzhi-wang/Llama3-8B-Chinese-Chat with llama-factory/saves/llama3-8b/lora/sft_bf16_p1_full [INFO|tokenization_utils_base.py:2108] 2024-07-17 06:06:46,704 >> loading file tokenizer.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer.json [INFO|tokenization_utils_base.py:2108] 2024-07-17 06:06:46,704 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2108] 2024-07-17 06:06:46,704 >> loading file special_tokens_map.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/special_tokens_map.json [INFO|tokenization_utils_base.py:2108] 2024-07-17 06:06:46,704 >> loading file tokenizer_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/tokenizer_config.json [WARNING|logging.py:314] 2024-07-17 06:06:47,238 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-17 06:06:47,472 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/config.json [INFO|configuration_utils.py:796] 2024-07-17 06:06:47,472 >> Model config LlamaConfig { "_name_or_path": "shenzhi-wang/Llama3-8B-Chinese-Chat", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 128256 } [INFO|modeling_utils.py:3474] 2024-07-17 06:06:47,556 >> loading weights file model.safetensors from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-07-17 06:06:47,564 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:962] 2024-07-17 06:06:47,565 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009 } Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4288] 2024-07-17 06:07:05,768 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at shenzhi-wang/Llama3-8B-Chinese-Chat. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:917] 2024-07-17 06:07:06,019 >> loading configuration file generation_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--shenzhi-wang--Llama3-8B-Chinese-Chat/snapshots/f25f13cb2571e70e285121faceac92926b51e6f5/generation_config.json [INFO|configuration_utils.py:962] 2024-07-17 06:07:06,019 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009, "pad_token_id": 128009 } srun: Job step aborted: Waiting up to 17 seconds for job step to finish. 0%| | 0/3000 [00:00