Submitting job: /common/home/users/d/dh.huang.2023/code/rapget-translation/scripts/tune-mac-4gpu.sh Current Directory: /common/home/users/d/dh.huang.2023/code/rapget-translation Tue Aug 13 09:24:39 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 PCIe On | 00000000:01:00.0 Off | 0 | | N/A 36C P0 49W / 350W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Linux holiday 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Thu Jun 6 09:41:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux NAME="Rocky Linux" VERSION="8.10 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" SUPPORT_END="2029-05-31" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.10" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.10" Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9554 64-Core Processor Stepping: 1 CPU MHz: 3100.000 CPU max MHz: 3762.9880 CPU min MHz: 1500.0000 BogoMIPS: 6190.80 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d MemTotal: 527521124 kB Current Directory: /common/home/users/d/dh.huang.2023/code/rapget-translation/llama-factory [nltk_data] Downloading package wordnet to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! loading env vars from: /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/.env Adding /common/home/users/d/dh.huang.2023/common2/code/rapget-translation to sys.path loading: /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/eval_modules/calc_repetitions.py loading /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/llm_toolkit/translation_utils.py Qwen Qwen2-72B-Instruct qwen config/mac_template_4gpu.yaml ../datasets/mac/mac.tsv Writing to config/models/Qwen2-72B-Instruct.yaml config/models/Qwen2-72B-Instruct.yaml: { "model_name_or_path": "Qwen/Qwen2-72B-Instruct", "quantization_bit": 4, "stage": "sft", "do_train": true, "finetuning_type": "lora", "lora_target": "all", "dataset": "alpaca_mac", "template": "qwen", "cutoff_len": 1024, "max_samples": 4528, "overwrite_cache": true, "preprocessing_num_workers": 16, "output_dir": "saves/Qwen2-72B-Instruct", "logging_steps": 5, "save_steps": 70, "plot_loss": true, "per_device_train_batch_size": 2, "gradient_accumulation_steps": 8, "learning_rate": 0.0001, "num_train_epochs": 6.0, "lr_scheduler_type": "cosine", "warmup_ratio": 0.1, "bf16": true, "ddp_timeout": 180000000, "val_size": 0.01, "per_device_eval_batch_size": 1, "eval_strategy": "steps", "eval_steps": 70, "report_to": "wandb", "run_name": "Qwen2-72B-Instruct_lora_sft" } loading existing data from: data/alpaca_mac.json -------------------------------------------------- system: You are a helpful assistant that translates Chinese to English. -------------------------------------------------- instruction: You will be given a Chinese sentence to translate. If it is an incomplete sentence, or if you are unsure about the meaning, simply copy the input text as your output. Do not output any additional sentence such as explanation or reasoning. Chinese: 全仗着狐仙搭救。 English: -------------------------------------------------- input: -------------------------------------------------- output: Because I was protected by a fox fairy. -------------------------------------------------- system: You are a helpful assistant that translates Chinese to English. -------------------------------------------------- instruction: You will be given a Chinese sentence to translate. If it is an incomplete sentence, or if you are unsure about the meaning, simply copy the input text as your output. Do not output any additional sentence such as explanation or reasoning. Chinese: 上面说,这样写缺少细节。 English: -------------------------------------------------- input: -------------------------------------------------- output: This time the opinions from above said it needed more detail. 08/13/2024 09:25:14 - WARNING - llamafactory.hparams.parser - We recommend enable `upcast_layernorm` in quantized training. 08/13/2024 09:25:14 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file vocab.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/vocab.json [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file merges.txt from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/merges.txt [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file tokenizer.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/tokenizer.json [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file special_tokens_map.json from cache at None [INFO|tokenization_utils_base.py:2289] 2024-08-13 09:25:14,478 >> loading file tokenizer_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/tokenizer_config.json [INFO|tokenization_utils_base.py:2533] 2024-08-13 09:25:14,614 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 08/13/2024 09:25:14 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 08/13/2024 09:25:14 - INFO - llamafactory.data.loader - Loading dataset alpaca_mac.json... Converting format of dataset (num_proc=16): 0%| | 0/4528 [00:00> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 09:25:17,826 >> Model config Qwen2Config { "_name_or_path": "Qwen/Qwen2-72B-Instruct", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } training example: input_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 429, 46918, 8453, 311, 6364, 13, 151645, 198, 151644, 872, 198, 2610, 686, 387, 2661, 264, 8453, 11652, 311, 14683, 13, 1416, 432, 374, 458, 32143, 11652, 11, 476, 421, 498, 525, 42903, 911, 279, 7290, 11, 4936, 2975, 279, 1946, 1467, 438, 697, 2550, 13, 3155, 537, 2550, 894, 5107, 11652, 1741, 438, 16148, 476, 32711, 382, 44923, 25, 34369, 101, 102895, 99164, 100324, 100717, 100095, 99509, 8997, 22574, 25, 151645, 198, 151644, 77091, 198, 17949, 358, 572, 2617, 553, 264, 38835, 44486, 13, 151645] inputs: <|im_start|>system You are a helpful assistant that translates Chinese to English.<|im_end|> <|im_start|>user You will be given a Chinese sentence to translate. If it is an incomplete sentence, or if you are unsure about the meaning, simply copy the input text as your output. Do not output any additional sentence such as explanation or reasoning. Chinese: 全仗着狐仙搭救。 English:<|im_end|> <|im_start|>assistant Because I was protected by a fox fairy.<|im_end|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 17949, 358, 572, 2617, 553, 264, 38835, 44486, 13, 151645] labels: Because I was protected by a fox fairy.<|im_end|> 08/13/2024 09:25:17 - INFO - llamafactory.model.model_utils.quantization - Quantizing model to 4 bit with bitsandbytes. [INFO|modeling_utils.py:3634] 2024-08-13 09:25:18,127 >> loading weights file model.safetensors from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/model.safetensors.index.json [INFO|modeling_utils.py:1572] 2024-08-13 09:25:18,141 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1038] 2024-08-13 09:25:18,142 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Loading checkpoint shards: 0%| | 0/37 [00:00> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4471] 2024-08-13 09:27:40,620 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at Qwen/Qwen2-72B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:993] 2024-08-13 09:27:40,880 >> loading configuration file generation_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/generation_config.json [INFO|configuration_utils.py:1038] 2024-08-13 09:27:40,880 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 } 08/13/2024 09:36:17 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 08/13/2024 09:36:17 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 08/13/2024 09:36:17 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 08/13/2024 09:36:17 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 08/13/2024 09:36:17 - INFO - llamafactory.model.model_utils.misc - Found linear modules: k_proj,up_proj,gate_proj,v_proj,q_proj,down_proj,o_proj 08/13/2024 09:36:18 - INFO - llamafactory.model.loader - trainable params: 105,267,200 || all params: 72,811,470,848 || trainable%: 0.1446 Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:648] 2024-08-13 09:36:18,951 >> Using auto half precision backend [INFO|trainer.py:2134] 2024-08-13 09:36:20,064 >> ***** Running training ***** [INFO|trainer.py:2135] 2024-08-13 09:36:20,064 >> Num examples = 4,482 [INFO|trainer.py:2136] 2024-08-13 09:36:20,064 >> Num Epochs = 6 [INFO|trainer.py:2137] 2024-08-13 09:36:20,064 >> Instantaneous batch size per device = 2 [INFO|trainer.py:2140] 2024-08-13 09:36:20,064 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2141] 2024-08-13 09:36:20,064 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2142] 2024-08-13 09:36:20,064 >> Total optimization steps = 1,680 [INFO|trainer.py:2143] 2024-08-13 09:36:20,072 >> Number of trainable parameters = 105,267,200 [INFO|integration_utils.py:807] 2024-08-13 09:36:20,079 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: inflaton-sg (inflaton-ai). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.17.6 wandb: Run data is saved locally in /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/llama-factory/wandb/run-20240813_093621-8lhczcch wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run Qwen2-72B-Instruct_lora_sft wandb: ⭐️ View project at https://wandb.ai/inflaton-ai/huggingface wandb: 🚀 View run at https://wandb.ai/inflaton-ai/huggingface/runs/8lhczcch 0%| | 0/1680 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 09:50:20,638 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 09:50:20,638 >> Batch size = 1 {'loss': 2.7425, 'grad_norm': 1.8217403888702393, 'learning_rate': 2.9761904761904763e-06, 'epoch': 0.02} {'loss': 2.861, 'grad_norm': 2.104698419570923, 'learning_rate': 5.9523809523809525e-06, 'epoch': 0.04} {'loss': 2.8281, 'grad_norm': 2.7389333248138428, 'learning_rate': 8.92857142857143e-06, 'epoch': 0.05} {'loss': 3.1888, 'grad_norm': 3.9298207759857178, 'learning_rate': 1.1904761904761905e-05, 'epoch': 0.07} {'loss': 2.6461, 'grad_norm': 2.648014783859253, 'learning_rate': 1.4880952380952381e-05, 'epoch': 0.09} {'loss': 2.3212, 'grad_norm': 1.587472915649414, 'learning_rate': 1.785714285714286e-05, 'epoch': 0.11} {'loss': 1.8036, 'grad_norm': 0.8390935063362122, 'learning_rate': 2.0833333333333336e-05, 'epoch': 0.12} {'loss': 1.5552, 'grad_norm': 0.46670979261398315, 'learning_rate': 2.380952380952381e-05, 'epoch': 0.14} {'loss': 1.6626, 'grad_norm': 0.45171597599983215, 'learning_rate': 2.6785714285714288e-05, 'epoch': 0.16} {'loss': 1.4897, 'grad_norm': 0.5605499744415283, 'learning_rate': 2.9761904761904762e-05, 'epoch': 0.18} {'loss': 1.5373, 'grad_norm': 0.5553259253501892, 'learning_rate': 3.273809523809524e-05, 'epoch': 0.2} {'loss': 1.4779, 'grad_norm': 0.6260251402854919, 'learning_rate': 3.571428571428572e-05, 'epoch': 0.21} {'loss': 1.483, 'grad_norm': 0.6063796877861023, 'learning_rate': 3.8690476190476195e-05, 'epoch': 0.23} {'loss': 1.5022, 'grad_norm': 0.5549850463867188, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-70 [INFO|configuration_utils.py:733] 2024-08-13 09:50:38,962 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 09:50:38,963 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 09:50:39,206 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-70/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 09:50:39,207 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-70/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 4%|▍ | 71/1680 [14:23<7:58:36, 17.85s/it] 4%|▍ | 72/1680 [14:35<7:07:40, 15.96s/it] 4%|▍ | 73/1680 [14:47<6:34:57, 14.75s/it] 4%|▍ | 74/1680 [14:59<6:11:51, 13.89s/it] 4%|▍ | 75/1680 [15:11<5:58:41, 13.41s/it] 4%|▍ | 75/1680 [15:11<5:58:41, 13.41s/it] 5%|▍ | 76/1680 [15:23<5:49:32, 13.08s/it] 5%|▍ | 77/1680 [15:34<5:28:41, 12.30s/it] 5%|▍ | 78/1680 [15:47<5:31:25, 12.41s/it] 5%|▍ | 79/1680 [15:58<5:22:40, 12.09s/it] 5%|▍ | 80/1680 [16:11<5:30:07, 12.38s/it] 5%|▍ | 80/1680 [16:11<5:30:07, 12.38s/it] 5%|▍ | 81/1680 [16:23<5:24:58, 12.19s/it] 5%|▍ | 82/1680 [16:34<5:20:33, 12.04s/it] 5%|▍ | 83/1680 [16:46<5:20:08, 12.03s/it] 5%|▌ | 84/1680 [16:58<5:18:37, 11.98s/it] 5%|▌ | 85/1680 [17:11<5:22:35, 12.14s/it] 5%|▌ | 85/1680 [17:11<5:22:35, 12.14s/it] 5%|▌ | 86/1680 [17:22<5:12:10, 11.75s/it] 5%|▌ | 87/1680 [17:34<5:14:04, 11.83s/it] 5%|▌ | 88/1680 [17:46<5:15:31, 11.89s/it] 5%|▌ | 89/1680 [17:57<5:13:15, 11.81s/it] 5%|▌ | 90/1680 [18:10<5:22:42, 12.18s/it] 5%|▌ | 90/1680 [18:10<5:22:42, 12.18s/it] 5%|▌ | 91/1680 [18:22<5:17:31, 11.99s/it] 5%|▌ | 92/1680 [18:34<5:14:24, 11.88s/it] 6%|▌ | 93/1680 [18:45<5:09:11, 11.69s/it] 6%|▌ | 94/1680 [18:57<5:09:38, 11.71s/it] 6%|▌ | 95/1680 [19:08<5:04:30, 11.53s/it] 6%|▌ | 95/1680 [19:08<5:04:30, 11.53s/it] 6%|▌ | 96/1680 [19:20<5:11:10, 11.79s/it] 6%|▌ | 97/1680 [19:32<5:09:07, 11.72s/it] 6%|▌ | 98/1680 [19:43<5:09:50, 11.75s/it] 6%|▌ | 99/1680 [19:55<5:11:37, 11.83s/it] 6%|▌ | 100/1680 [20:08<5:13:24, 11.90s/it] 6%|▌ | 100/1680 [20:08<5:13:24, 11.90s/it] 6%|▌ | 101/1680 [20:20<5:14:20, 11.94s/it] 6%|▌ | 102/1680 [20:31<5:09:43, 11.78s/it] 6%|▌ | 103/1680 [20:43<5:13:46, 11.94s/it] 6%|▌ | 104/1680 [20:55<5:09:41, 11.79s/it] 6%|▋ | 105/1680 [21:07<5:14:47, 11.99s/it] 6%|▋ | 105/1680 [21:07<5:14:47, 11.99s/it] 6%|▋ | 106/1680 [21:19<5:14:12, 11.98s/it] 6%|▋ | 107/1680 [21:32<5:18:45, 12.16s/it] 6%|▋ | 108/1680 [21:44<5:17:08, 12.10s/it] 6%|▋ | 109/1680 [21:55<5:10:49, 11.87s/it] 7%|▋ | 110/1680 [22:07<5:08:50, 11.80s/it] 7%|▋ | 110/1680 [22:07<5:08:50, 11.80s/it] 7%|▋ | 111/1680 [22:19<5:12:38, 11.96s/it] 7%|▋ | 112/1680 [22:31<5:11:11, 11.91s/it] 7%|▋ | 113/1680 [22:43<5:09:51, 11.86s/it] 7%|▋ | 114/1680 [22:55<5:12:48, 11.98s/it] 7%|▋ | 115/1680 [23:06<5:07:24, 11.79s/it] 7%|▋ | 115/1680 [23:06<5:07:24, 11.79s/it] 7%|▋ | 116/1680 [23:17<5:04:10, 11.67s/it] 7%|▋ | 117/1680 [23:30<5:07:06, 11.79s/it] 7%|▋ | 118/1680 [23:42<5:15:14, 12.11s/it] 7%|▋ | 119/1680 [23:54<5:14:21, 12.08s/it] 7%|▋ | 120/1680 [24:07<5:16:10, 12.16s/it] 7%|▋ | 120/1680 [24:07<5:16:10, 12.16s/it] 7%|▋ | 121/1680 [24:19<5:14:51, 12.12s/it] 7%|▋ | 122/1680 [24:30<5:06:01, 11.79s/it] 7%|▋ | 123/1680 [24:42<5:09:40, 11.93s/it] 7%|▋ | 124/1680 [24:53<5:05:17, 11.77s/it] 7%|▋ | 125/1680 [25:06<5:07:08, 11.85s/it] 7%|▋ | 125/1680 [25:06<5:07:08, 11.85s/it] 8%|▊ | 126/1680 [25:16<4:58:17, 11.52s/it] 8%|▊ | 127/1680 [25:27<4:54:39, 11.38s/it] 8%|▊ | 128/1680 [25:40<5:02:18, 11.69s/it] 8%|▊ | 129/1680 [25:52<5:03:18, 11.73s/it] 8%|▊ | 130/1680 [26:04<5:10:31, 12.02s/it] 8%|▊ | 130/1680 [26:04<5:10:31, 12.02s/it] 8%|▊ | 131/1680 [26:16<5:07:01, 11.89s/it] 8%|▊ | 132/1680 [26:27<5:03:21, 11.76s/it] 8%|▊ | 133/1680 [26:38<4:57:07, 11.52s/it] 8%|▊ | 134/1680 [26:51<5:05:41, 11.86s/it] 8%|▊ | 135/1680 [27:03<5:04:49, 11.84s/it] 8%|▊ | 135/1680 [27:03<5:04:49, 11.84s/it] 8%|▊ | 136/1680 [27:15<5:05:08, 11.86s/it] 8%|▊ | 137/1680 [27:26<5:04:47, 11.85s/it] 8%|▊ | 138/1680 [27:39<5:06:44, 11.94s/it] 8%|▊ | 139/1680 [27:50<5:05:17, 11.89s/it] 8%|▊ | 140/1680 [28:02<5:03:50, 11.84s/it] 8%|▊ | 140/1680 [28:02<5:03:50, 11.84s/it][INFO|trainer.py:3819] 2024-08-13 10:04:30,555 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 10:04:30,555 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 10:04:30,555 >> Batch size = 1 {'eval_loss': 1.451762318611145, 'eval_runtime': 17.7549, 'eval_samples_per_second': 2.591, 'eval_steps_per_second': 2.591, 'epoch': 0.25} {'loss': 1.4256, 'grad_norm': 0.482930988073349, 'learning_rate': 4.464285714285715e-05, 'epoch': 0.27} {'loss': 1.3655, 'grad_norm': 0.4240593910217285, 'learning_rate': 4.761904761904762e-05, 'epoch': 0.29} {'loss': 1.4478, 'grad_norm': 0.4872314929962158, 'learning_rate': 5.05952380952381e-05, 'epoch': 0.3} {'loss': 1.3305, 'grad_norm': 0.42132768034935, 'learning_rate': 5.3571428571428575e-05, 'epoch': 0.32} {'loss': 1.4279, 'grad_norm': 0.6932046413421631, 'learning_rate': 5.6547619047619046e-05, 'epoch': 0.34} {'loss': 1.4967, 'grad_norm': 0.6714524626731873, 'learning_rate': 5.9523809523809524e-05, 'epoch': 0.36} {'loss': 1.4739, 'grad_norm': 0.5682816505432129, 'learning_rate': 6.25e-05, 'epoch': 0.37} {'loss': 1.3751, 'grad_norm': 0.7795937657356262, 'learning_rate': 6.547619047619048e-05, 'epoch': 0.39} {'loss': 1.3699, 'grad_norm': 0.8056842088699341, 'learning_rate': 6.845238095238096e-05, 'epoch': 0.41} {'loss': 1.4696, 'grad_norm': 0.8373801112174988, 'learning_rate': 7.142857142857143e-05, 'epoch': 0.43} {'loss': 1.4059, 'grad_norm': 1.0051416158676147, 'learning_rate': 7.440476190476191e-05, 'epoch': 0.45} {'loss': 1.3072, 'grad_norm': 0.5304180383682251, 'learning_rate': 7.738095238095239e-05, 'epoch': 0.46} {'loss': 1.4132, 'grad_norm': 0.8797634243965149, 'learning_rate': 8.035714285714287e-05, 'epoch': 0.48} {'loss': 1.4121, 'grad_norm': 0.9049625396728516, 'learning_rate': 8.333333333333334e-05, 'epoch': 0.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-140 [INFO|configuration_utils.py:733] 2024-08-13 10:04:48,938 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 10:04:48,939 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 10:04:49,179 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-140/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 10:04:49,179 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-140/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 8%|▊ | 141/1680 [28:33<7:33:55, 17.70s/it] 8%|▊ | 142/1680 [28:46<6:51:51, 16.07s/it] 9%|▊ | 143/1680 [28:57<6:14:57, 14.64s/it] 9%|▊ | 144/1680 [29:09<5:51:13, 13.72s/it] 9%|▊ | 145/1680 [29:21<5:40:15, 13.30s/it] 9%|▊ | 145/1680 [29:21<5:40:15, 13.30s/it] 9%|▊ | 146/1680 [29:33<5:34:11, 13.07s/it] 9%|▉ | 147/1680 [29:45<5:22:04, 12.61s/it] 9%|▉ | 148/1680 [29:58<5:24:07, 12.69s/it] 9%|▉ | 149/1680 [30:09<5:13:40, 12.29s/it] 9%|▉ | 150/1680 [30:21<5:07:33, 12.06s/it] 9%|▉ | 150/1680 [30:21<5:07:33, 12.06s/it] 9%|▉ | 151/1680 [30:33<5:06:55, 12.04s/it] 9%|▉ | 152/1680 [30:44<5:02:05, 11.86s/it] 9%|▉ | 153/1680 [30:56<5:02:18, 11.88s/it] 9%|▉ | 154/1680 [31:08<5:01:18, 11.85s/it] 9%|▉ | 155/1680 [31:20<5:01:45, 11.87s/it] 9%|▉ | 155/1680 [31:20<5:01:45, 11.87s/it] 9%|▉ | 156/1680 [31:32<5:02:57, 11.93s/it] 9%|▉ | 157/1680 [31:43<5:00:17, 11.83s/it] 9%|▉ | 158/1680 [31:55<5:00:38, 11.85s/it] 9%|▉ | 159/1680 [32:07<4:59:20, 11.81s/it] 10%|▉ | 160/1680 [32:19<4:59:42, 11.83s/it] 10%|▉ | 160/1680 [32:19<4:59:42, 11.83s/it] 10%|▉ | 161/1680 [32:31<4:57:32, 11.75s/it] 10%|▉ | 162/1680 [32:42<4:57:05, 11.74s/it] 10%|▉ | 163/1680 [32:54<4:55:40, 11.69s/it] 10%|▉ | 164/1680 [33:05<4:54:49, 11.67s/it] 10%|▉ | 165/1680 [33:18<4:58:34, 11.82s/it] 10%|▉ | 165/1680 [33:18<4:58:34, 11.82s/it] 10%|▉ | 166/1680 [33:30<5:00:33, 11.91s/it] 10%|▉ | 167/1680 [33:42<5:00:11, 11.90s/it] 10%|█ | 168/1680 [33:53<4:59:33, 11.89s/it] 10%|█ | 169/1680 [34:06<5:02:34, 12.01s/it] 10%|█ | 170/1680 [34:18<5:05:36, 12.14s/it] 10%|█ | 170/1680 [34:18<5:05:36, 12.14s/it] 10%|█ | 171/1680 [34:30<5:05:24, 12.14s/it] 10%|█ | 172/1680 [34:42<5:02:31, 12.04s/it] 10%|█ | 173/1680 [34:55<5:05:01, 12.14s/it] 10%|█ | 174/1680 [35:07<5:04:23, 12.13s/it] 10%|█ | 175/1680 [35:18<4:58:27, 11.90s/it] 10%|█ | 175/1680 [35:18<4:58:27, 11.90s/it] 10%|█ | 176/1680 [35:30<4:58:03, 11.89s/it] 11%|█ | 177/1680 [35:42<4:57:53, 11.89s/it] 11%|█ | 178/1680 [35:54<4:58:34, 11.93s/it] 11%|█ | 179/1680 [36:06<4:59:57, 11.99s/it] 11%|█ | 180/1680 [36:18<5:02:31, 12.10s/it] 11%|█ | 180/1680 [36:18<5:02:31, 12.10s/it] 11%|█ | 181/1680 [36:31<5:05:40, 12.24s/it] 11%|█ | 182/1680 [36:42<5:00:57, 12.05s/it] 11%|█ | 183/1680 [36:55<5:07:22, 12.32s/it] 11%|█ | 184/1680 [37:08<5:07:26, 12.33s/it] 11%|█ | 185/1680 [37:19<5:02:16, 12.13s/it] 11%|█ | 185/1680 [37:19<5:02:16, 12.13s/it] 11%|█ | 186/1680 [37:32<5:03:05, 12.17s/it] 11%|█ | 187/1680 [37:44<5:02:09, 12.14s/it] 11%|█ | 188/1680 [37:56<5:02:50, 12.18s/it] 11%|█▏ | 189/1680 [38:08<4:58:59, 12.03s/it] 11%|█▏ | 190/1680 [38:19<4:56:26, 11.94s/it] 11%|█▏ | 190/1680 [38:19<4:56:26, 11.94s/it] 11%|█▏ | 191/1680 [38:32<4:58:46, 12.04s/it] 11%|█▏ | 192/1680 [38:44<5:02:33, 12.20s/it] 11%|█▏ | 193/1680 [38:56<4:56:19, 11.96s/it] 12%|█▏ | 194/1680 [39:07<4:51:15, 11.76s/it] 12%|█▏ | 195/1680 [39:19<4:50:04, 11.72s/it] 12%|█▏ | 195/1680 [39:19<4:50:04, 11.72s/it] 12%|█▏ | 196/1680 [39:30<4:50:54, 11.76s/it] 12%|█▏ | 197/1680 [39:42<4:46:27, 11.59s/it] 12%|█▏ | 198/1680 [39:53<4:45:17, 11.55s/it] 12%|█▏ | 199/1680 [40:05<4:45:22, 11.56s/it] 12%|█▏ | 200/1680 [40:17<4:48:22, 11.69s/it] 12%|█▏ | 200/1680 [40:17<4:48:22, 11.69s/it] 12%|█▏ | 201/1680 [40:28<4:48:21, 11.70s/it] 12%|█▏ | 202/1680 [40:41<4:55:23, 11.99s/it] 12%|█▏ | 203/1680 [40:53<4:56:08, 12.03s/it] 12%|█▏ | 204/1680 [41:05<4:50:50, 11.82s/it] 12%|█▏ | 205/1680 [41:16<4:51:06, 11.84s/it] 12%|█▏ | 205/1680 [41:16<4:51:06, 11.84s/it] 12%|█▏ | 206/1680 [41:28<4:51:59, 11.89s/it] 12%|█▏ | 207/1680 [41:41<4:54:34, 12.00s/it] 12%|█▏ | 208/1680 [41:52<4:51:50, 11.90s/it] 12%|█▏ | 209/1680 [42:04<4:50:51, 11.86s/it] 12%|█▎ | 210/1680 [42:16<4:54:22, 12.02s/it] 12%|█▎ | 210/1680 [42:16<4:54:22, 12.02s/it][INFO|trainer.py:3819] 2024-08-13 10:18:44,970 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 10:18:44,970 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 10:18:44,971 >> Batch size = 1 {'eval_loss': 1.3727394342422485, 'eval_runtime': 17.745, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 0.5} {'loss': 1.3109, 'grad_norm': 0.6793915033340454, 'learning_rate': 8.630952380952382e-05, 'epoch': 0.52} {'loss': 1.3781, 'grad_norm': 0.7171015739440918, 'learning_rate': 8.92857142857143e-05, 'epoch': 0.54} {'loss': 1.3564, 'grad_norm': 0.6738716959953308, 'learning_rate': 9.226190476190478e-05, 'epoch': 0.55} {'loss': 1.2387, 'grad_norm': 0.699975311756134, 'learning_rate': 9.523809523809524e-05, 'epoch': 0.57} {'loss': 1.3042, 'grad_norm': 0.7659904956817627, 'learning_rate': 9.821428571428572e-05, 'epoch': 0.59} {'loss': 1.3709, 'grad_norm': 0.9782125353813171, 'learning_rate': 9.999956828659095e-05, 'epoch': 0.61} {'loss': 1.3844, 'grad_norm': 1.0532957315444946, 'learning_rate': 9.999471159635539e-05, 'epoch': 0.62} {'loss': 1.2852, 'grad_norm': 0.7373877167701721, 'learning_rate': 9.998445910004082e-05, 'epoch': 0.64} {'loss': 1.4652, 'grad_norm': 1.0207768678665161, 'learning_rate': 9.996881190417393e-05, 'epoch': 0.66} {'loss': 1.3743, 'grad_norm': 0.7943917512893677, 'learning_rate': 9.994777169751806e-05, 'epoch': 0.68} {'loss': 1.2423, 'grad_norm': 0.7461659908294678, 'learning_rate': 9.992134075089084e-05, 'epoch': 0.7} {'loss': 1.3113, 'grad_norm': 0.9689913988113403, 'learning_rate': 9.988952191691925e-05, 'epoch': 0.71} {'loss': 1.3524, 'grad_norm': 0.766276478767395, 'learning_rate': 9.985231862973168e-05, 'epoch': 0.73} {'loss': 1.4038, 'grad_norm': 0.6728419661521912, 'learning_rate': 9.980973490458728e-05, 'epoch': 0.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-210 [INFO|configuration_utils.py:733] 2024-08-13 10:19:03,320 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 10:19:03,321 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 10:19:03,551 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-210/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 10:19:03,551 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-210/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 13%|█▎ | 211/1680 [42:48<7:19:58, 17.97s/it] 13%|█▎ | 212/1680 [43:01<6:41:26, 16.41s/it] 13%|█▎ | 213/1680 [43:13<6:08:56, 15.09s/it] 13%|█▎ | 214/1680 [43:25<5:41:44, 13.99s/it] 13%|█▎ | 215/1680 [43:36<5:20:21, 13.12s/it] 13%|█▎ | 215/1680 [43:36<5:20:21, 13.12s/it] 13%|█▎ | 216/1680 [43:47<5:08:23, 12.64s/it] 13%|█▎ | 217/1680 [43:59<5:03:47, 12.46s/it] 13%|█▎ | 218/1680 [44:11<4:58:34, 12.25s/it] 13%|█▎ | 219/1680 [44:23<4:54:19, 12.09s/it] 13%|█▎ | 220/1680 [44:35<4:53:24, 12.06s/it] 13%|█▎ | 220/1680 [44:35<4:53:24, 12.06s/it] 13%|█▎ | 221/1680 [44:47<4:52:17, 12.02s/it] 13%|█▎ | 222/1680 [44:59<4:55:53, 12.18s/it] 13%|█▎ | 223/1680 [45:10<4:45:13, 11.75s/it] 13%|█▎ | 224/1680 [45:21<4:42:18, 11.63s/it] 13%|█▎ | 225/1680 [45:34<4:47:19, 11.85s/it] 13%|█▎ | 225/1680 [45:34<4:47:19, 11.85s/it] 13%|█▎ | 226/1680 [45:45<4:45:31, 11.78s/it] 14%|█▎ | 227/1680 [45:56<4:40:33, 11.59s/it] 14%|█▎ | 228/1680 [46:09<4:47:23, 11.88s/it] 14%|█▎ | 229/1680 [46:21<4:47:56, 11.91s/it] 14%|█▎ | 230/1680 [46:32<4:43:39, 11.74s/it] 14%|█▎ | 230/1680 [46:32<4:43:39, 11.74s/it] 14%|█▍ | 231/1680 [46:44<4:44:13, 11.77s/it] 14%|█▍ | 232/1680 [46:56<4:45:04, 11.81s/it] 14%|█▍ | 233/1680 [47:08<4:48:27, 11.96s/it] 14%|█▍ | 234/1680 [47:20<4:49:49, 12.03s/it] 14%|█▍ | 235/1680 [47:32<4:49:19, 12.01s/it] 14%|█▍ | 235/1680 [47:32<4:49:19, 12.01s/it] 14%|█▍ | 236/1680 [47:45<4:50:08, 12.06s/it] 14%|█▍ | 237/1680 [47:56<4:46:33, 11.92s/it] 14%|█▍ | 238/1680 [48:09<4:49:51, 12.06s/it] 14%|█▍ | 239/1680 [48:21<4:48:46, 12.02s/it] 14%|█▍ | 240/1680 [48:31<4:37:28, 11.56s/it] 14%|█▍ | 240/1680 [48:31<4:37:28, 11.56s/it] 14%|█▍ | 241/1680 [48:43<4:41:57, 11.76s/it] 14%|█▍ | 242/1680 [48:55<4:41:12, 11.73s/it] 14%|█▍ | 243/1680 [49:06<4:38:03, 11.61s/it] 15%|█▍ | 244/1680 [49:18<4:38:24, 11.63s/it] 15%|█▍ | 245/1680 [49:30<4:42:48, 11.82s/it] 15%|█▍ | 245/1680 [49:30<4:42:48, 11.82s/it] 15%|█▍ | 246/1680 [49:42<4:46:02, 11.97s/it] 15%|█▍ | 247/1680 [49:54<4:43:02, 11.85s/it] 15%|█▍ | 248/1680 [50:07<4:49:14, 12.12s/it] 15%|█▍ | 249/1680 [50:19<4:50:32, 12.18s/it] 15%|█▍ | 250/1680 [50:30<4:40:29, 11.77s/it] 15%|█▍ | 250/1680 [50:30<4:40:29, 11.77s/it] 15%|█▍ | 251/1680 [50:42<4:42:14, 11.85s/it] 15%|█▌ | 252/1680 [50:53<4:37:32, 11.66s/it] 15%|█▌ | 253/1680 [51:05<4:40:22, 11.79s/it] 15%|█▌ | 254/1680 [51:17<4:41:08, 11.83s/it] 15%|█▌ | 255/1680 [51:29<4:41:17, 11.84s/it] 15%|█▌ | 255/1680 [51:29<4:41:17, 11.84s/it] 15%|█▌ | 256/1680 [51:41<4:43:02, 11.93s/it] 15%|█▌ | 257/1680 [51:53<4:44:48, 12.01s/it] 15%|█▌ | 258/1680 [52:06<4:48:40, 12.18s/it] 15%|█▌ | 259/1680 [52:18<4:44:33, 12.01s/it] 15%|█▌ | 260/1680 [52:30<4:44:29, 12.02s/it] 15%|█▌ | 260/1680 [52:30<4:44:29, 12.02s/it] 16%|█▌ | 261/1680 [52:42<4:43:57, 12.01s/it] 16%|█▌ | 262/1680 [52:55<4:50:09, 12.28s/it] 16%|█▌ | 263/1680 [53:07<4:48:00, 12.19s/it] 16%|█▌ | 264/1680 [53:18<4:45:07, 12.08s/it] 16%|█▌ | 265/1680 [53:30<4:43:28, 12.02s/it] 16%|█▌ | 265/1680 [53:30<4:43:28, 12.02s/it] 16%|█▌ | 266/1680 [53:42<4:40:29, 11.90s/it] 16%|█▌ | 267/1680 [53:54<4:41:26, 11.95s/it] 16%|█▌ | 268/1680 [54:06<4:44:06, 12.07s/it] 16%|█▌ | 269/1680 [54:17<4:37:08, 11.78s/it] 16%|█▌ | 270/1680 [54:30<4:39:21, 11.89s/it] 16%|█▌ | 270/1680 [54:30<4:39:21, 11.89s/it] 16%|█▌ | 271/1680 [54:42<4:41:07, 11.97s/it] 16%|█▌ | 272/1680 [54:54<4:43:16, 12.07s/it] 16%|█▋ | 273/1680 [55:05<4:37:42, 11.84s/it] 16%|█▋ | 274/1680 [55:18<4:43:19, 12.09s/it] 16%|█▋ | 275/1680 [55:31<4:46:17, 12.23s/it] 16%|█▋ | 275/1680 [55:31<4:46:17, 12.23s/it] 16%|█▋ | 276/1680 [55:42<4:38:37, 11.91s/it] 16%|█▋ | 277/1680 [55:54<4:42:06, 12.06s/it] 17%|█▋ | 278/1680 [56:05<4:37:09, 11.86s/it] 17%|█▋ | 279/1680 [56:18<4:42:43, 12.11s/it] 17%|█▋ | 280/1680 [56:31<4:49:48, 12.42s/it] 17%|█▋ | 280/1680 [56:31<4:49:48, 12.42s/it][INFO|trainer.py:3819] 2024-08-13 10:32:59,813 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 10:32:59,814 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 10:32:59,814 >> Batch size = 1 {'eval_loss': 1.3051044940948486, 'eval_runtime': 17.7559, 'eval_samples_per_second': 2.591, 'eval_steps_per_second': 2.591, 'epoch': 0.75} {'loss': 1.3626, 'grad_norm': 1.0456575155258179, 'learning_rate': 9.976177533744261e-05, 'epoch': 0.77} {'loss': 1.3232, 'grad_norm': 0.9017456769943237, 'learning_rate': 9.97084451044556e-05, 'epoch': 0.79} {'loss': 1.2826, 'grad_norm': 0.9113703966140747, 'learning_rate': 9.964974996142698e-05, 'epoch': 0.8} {'loss': 1.2794, 'grad_norm': 0.7177279591560364, 'learning_rate': 9.958569624317893e-05, 'epoch': 0.82} {'loss': 1.3853, 'grad_norm': 0.9058728814125061, 'learning_rate': 9.951629086287151e-05, 'epoch': 0.84} {'loss': 1.3533, 'grad_norm': 0.6813459992408752, 'learning_rate': 9.944154131125642e-05, 'epoch': 0.86} {'loss': 1.3395, 'grad_norm': 0.7113555073738098, 'learning_rate': 9.936145565586871e-05, 'epoch': 0.87} {'loss': 1.443, 'grad_norm': 1.243597149848938, 'learning_rate': 9.927604254015585e-05, 'epoch': 0.89} {'loss': 1.398, 'grad_norm': 0.8651953339576721, 'learning_rate': 9.918531118254507e-05, 'epoch': 0.91} {'loss': 1.346, 'grad_norm': 0.8877395987510681, 'learning_rate': 9.90892713754483e-05, 'epoch': 0.93} {'loss': 1.3921, 'grad_norm': 0.8857008814811707, 'learning_rate': 9.898793348420536e-05, 'epoch': 0.95} {'loss': 1.3838, 'grad_norm': 0.8319969177246094, 'learning_rate': 9.888130844596524e-05, 'epoch': 0.96} {'loss': 1.3529, 'grad_norm': 0.7452044486999512, 'learning_rate': 9.876940776850569e-05, 'epoch': 0.98} {'loss': 1.2739, 'grad_norm': 0.7535015940666199, 'learning_rate': 9.865224352899119e-05, 'epoch': 1.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-280 [INFO|configuration_utils.py:733] 2024-08-13 10:33:18,197 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 10:33:18,197 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 10:33:18,464 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-280/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 10:33:18,465 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-280/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 17%|█▋ | 281/1680 [57:03<7:06:17, 18.28s/it] 17%|█▋ | 282/1680 [57:14<6:15:10, 16.10s/it] 17%|█▋ | 283/1680 [57:27<5:51:37, 15.10s/it] 17%|█▋ | 284/1680 [57:39<5:30:03, 14.19s/it] 17%|█▋ | 285/1680 [57:51<5:13:07, 13.47s/it] 17%|█▋ | 285/1680 [57:51<5:13:07, 13.47s/it] 17%|█▋ | 286/1680 [58:03<4:59:52, 12.91s/it] 17%|█▋ | 287/1680 [58:14<4:51:05, 12.54s/it] 17%|█▋ | 288/1680 [58:25<4:42:17, 12.17s/it] 17%|█▋ | 289/1680 [58:36<4:32:35, 11.76s/it] 17%|█▋ | 290/1680 [58:49<4:37:01, 11.96s/it] 17%|█▋ | 290/1680 [58:49<4:37:01, 11.96s/it] 17%|█▋ | 291/1680 [59:01<4:36:44, 11.95s/it] 17%|█▋ | 292/1680 [59:13<4:42:04, 12.19s/it] 17%|█▋ | 293/1680 [59:25<4:40:35, 12.14s/it] 18%|█▊ | 294/1680 [59:38<4:41:59, 12.21s/it] 18%|█▊ | 295/1680 [59:50<4:39:04, 12.09s/it] 18%|█▊ | 295/1680 [59:50<4:39:04, 12.09s/it] 18%|█▊ | 296/1680 [1:00:01<4:37:17, 12.02s/it] 18%|█▊ | 297/1680 [1:00:13<4:34:20, 11.90s/it] 18%|█▊ | 298/1680 [1:00:25<4:36:49, 12.02s/it] 18%|█▊ | 299/1680 [1:00:37<4:35:50, 11.98s/it] 18%|█▊ | 300/1680 [1:00:49<4:33:01, 11.87s/it] 18%|█▊ | 300/1680 [1:00:49<4:33:01, 11.87s/it] 18%|█▊ | 301/1680 [1:01:01<4:33:30, 11.90s/it] 18%|█▊ | 302/1680 [1:01:13<4:37:13, 12.07s/it] 18%|█▊ | 303/1680 [1:01:24<4:29:58, 11.76s/it] 18%|█▊ | 304/1680 [1:01:36<4:30:01, 11.77s/it] 18%|█▊ | 305/1680 [1:01:49<4:35:26, 12.02s/it] 18%|█▊ | 305/1680 [1:01:49<4:35:26, 12.02s/it] 18%|█▊ | 306/1680 [1:02:01<4:37:59, 12.14s/it] 18%|█▊ | 307/1680 [1:02:13<4:32:23, 11.90s/it] 18%|█▊ | 308/1680 [1:02:24<4:27:36, 11.70s/it] 18%|█▊ | 309/1680 [1:02:35<4:27:30, 11.71s/it] 18%|█▊ | 310/1680 [1:02:47<4:23:36, 11.55s/it] 18%|█▊ | 310/1680 [1:02:47<4:23:36, 11.55s/it] 19%|█▊ | 311/1680 [1:02:58<4:24:40, 11.60s/it] 19%|█▊ | 312/1680 [1:03:10<4:25:09, 11.63s/it] 19%|█▊ | 313/1680 [1:03:22<4:26:31, 11.70s/it] 19%|█▊ | 314/1680 [1:03:34<4:31:30, 11.93s/it] 19%|█▉ | 315/1680 [1:03:46<4:29:53, 11.86s/it] 19%|█▉ | 315/1680 [1:03:46<4:29:53, 11.86s/it] 19%|█▉ | 316/1680 [1:03:58<4:30:02, 11.88s/it] 19%|█▉ | 317/1680 [1:04:10<4:28:34, 11.82s/it] 19%|█▉ | 318/1680 [1:04:23<4:34:51, 12.11s/it] 19%|█▉ | 319/1680 [1:04:35<4:34:03, 12.08s/it] 19%|█▉ | 320/1680 [1:04:46<4:32:17, 12.01s/it] 19%|█▉ | 320/1680 [1:04:46<4:32:17, 12.01s/it] 19%|█▉ | 321/1680 [1:04:58<4:28:11, 11.84s/it] 19%|█▉ | 322/1680 [1:05:09<4:25:51, 11.75s/it] 19%|█▉ | 323/1680 [1:05:22<4:31:58, 12.03s/it] 19%|█▉ | 324/1680 [1:05:35<4:39:48, 12.38s/it] 19%|█▉ | 325/1680 [1:05:47<4:34:25, 12.15s/it] 19%|█▉ | 325/1680 [1:05:47<4:34:25, 12.15s/it] 19%|█▉ | 326/1680 [1:05:58<4:30:30, 11.99s/it] 19%|█▉ | 327/1680 [1:06:10<4:27:57, 11.88s/it] 20%|█▉ | 328/1680 [1:06:21<4:23:59, 11.72s/it] 20%|█▉ | 329/1680 [1:06:33<4:25:43, 11.80s/it] 20%|█▉ | 330/1680 [1:06:46<4:32:12, 12.10s/it] 20%|█▉ | 330/1680 [1:06:46<4:32:12, 12.10s/it] 20%|█▉ | 331/1680 [1:06:57<4:25:14, 11.80s/it] 20%|█▉ | 332/1680 [1:07:09<4:24:41, 11.78s/it] 20%|█▉ | 333/1680 [1:07:21<4:24:14, 11.77s/it] 20%|█▉ | 334/1680 [1:07:33<4:27:48, 11.94s/it] 20%|█▉ | 335/1680 [1:07:45<4:29:37, 12.03s/it] 20%|█▉ | 335/1680 [1:07:45<4:29:37, 12.03s/it] 20%|██ | 336/1680 [1:07:57<4:25:20, 11.85s/it] 20%|██ | 337/1680 [1:08:08<4:19:40, 11.60s/it] 20%|██ | 338/1680 [1:08:20<4:22:03, 11.72s/it] 20%|██ | 339/1680 [1:08:31<4:17:13, 11.51s/it] 20%|██ | 340/1680 [1:08:43<4:20:05, 11.65s/it] 20%|██ | 340/1680 [1:08:43<4:20:05, 11.65s/it] 20%|██ | 341/1680 [1:08:54<4:18:56, 11.60s/it] 20%|██ | 342/1680 [1:09:06<4:22:01, 11.75s/it] 20%|██ | 343/1680 [1:09:18<4:19:26, 11.64s/it] 20%|██ | 344/1680 [1:09:30<4:25:14, 11.91s/it] 21%|██ | 345/1680 [1:09:43<4:29:21, 12.11s/it] 21%|██ | 345/1680 [1:09:43<4:29:21, 12.11s/it] 21%|██ | 346/1680 [1:09:56<4:33:11, 12.29s/it] 21%|██ | 347/1680 [1:10:07<4:30:17, 12.17s/it] 21%|██ | 348/1680 [1:10:20<4:29:14, 12.13s/it] 21%|██ | 349/1680 [1:10:31<4:25:55, 11.99s/it] 21%|██ | 350/1680 [1:10:43<4:27:30, 12.07s/it] 21%|██ | 350/1680 [1:10:43<4:27:30, 12.07s/it][INFO|trainer.py:3819] 2024-08-13 10:47:11,901 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 10:47:11,901 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 10:47:11,901 >> Batch size = 1 {'eval_loss': 1.289029836654663, 'eval_runtime': 17.7491, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 1.0} {'loss': 1.2339, 'grad_norm': 0.7779117226600647, 'learning_rate': 9.852982837266955e-05, 'epoch': 1.02} {'loss': 1.0982, 'grad_norm': 0.8113610744476318, 'learning_rate': 9.840217551150706e-05, 'epoch': 1.04} {'loss': 1.2537, 'grad_norm': 1.004701852798462, 'learning_rate': 9.826929872276255e-05, 'epoch': 1.05} {'loss': 1.1664, 'grad_norm': 1.524734616279602, 'learning_rate': 9.81312123475006e-05, 'epoch': 1.07} {'loss': 1.08, 'grad_norm': 1.5680856704711914, 'learning_rate': 9.798793128904356e-05, 'epoch': 1.09} {'loss': 1.1029, 'grad_norm': 1.4838035106658936, 'learning_rate': 9.78394710113631e-05, 'epoch': 1.11} {'loss': 1.1524, 'grad_norm': 1.522316575050354, 'learning_rate': 9.768584753741134e-05, 'epoch': 1.12} {'loss': 1.1328, 'grad_norm': 1.3976528644561768, 'learning_rate': 9.752707744739145e-05, 'epoch': 1.14} {'loss': 1.1174, 'grad_norm': 1.4764764308929443, 'learning_rate': 9.736317787696816e-05, 'epoch': 1.16} {'loss': 1.0493, 'grad_norm': 1.3623173236846924, 'learning_rate': 9.719416651541839e-05, 'epoch': 1.18} {'loss': 1.0479, 'grad_norm': 1.3625001907348633, 'learning_rate': 9.702006160372209e-05, 'epoch': 1.2} {'loss': 1.1043, 'grad_norm': 1.7509726285934448, 'learning_rate': 9.684088193259355e-05, 'epoch': 1.21} {'loss': 1.1096, 'grad_norm': 1.5920188426971436, 'learning_rate': 9.665664684045333e-05, 'epoch': 1.23} {'loss': 1.1436, 'grad_norm': 1.6554943323135376, 'learning_rate': 9.646737621134112e-05, 'epoch': 1.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-350 [INFO|configuration_utils.py:733] 2024-08-13 10:47:33,896 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 10:47:33,897 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 10:47:34,133 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-350/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 10:47:34,133 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-350/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 21%|██ | 351/1680 [1:11:18<6:58:31, 18.90s/it] 21%|██ | 352/1680 [1:11:30<6:10:39, 16.75s/it] 21%|██ | 353/1680 [1:11:43<5:42:59, 15.51s/it] 21%|██ | 354/1680 [1:11:54<5:16:34, 14.32s/it] 21%|██ | 355/1680 [1:12:05<4:55:13, 13.37s/it] 21%|██ | 355/1680 [1:12:05<4:55:13, 13.37s/it] 21%|██ | 356/1680 [1:12:17<4:45:40, 12.95s/it] 21%|██▏ | 357/1680 [1:12:28<4:33:48, 12.42s/it] 21%|██▏ | 358/1680 [1:12:40<4:26:44, 12.11s/it] 21%|██▏ | 359/1680 [1:12:52<4:29:47, 12.25s/it] 21%|██▏ | 360/1680 [1:13:05<4:31:40, 12.35s/it] 21%|██▏ | 360/1680 [1:13:05<4:31:40, 12.35s/it] 21%|██▏ | 361/1680 [1:13:17<4:29:49, 12.27s/it] 22%|██▏ | 362/1680 [1:13:29<4:24:08, 12.02s/it] 22%|██▏ | 363/1680 [1:13:41<4:24:51, 12.07s/it] 22%|██▏ | 364/1680 [1:13:53<4:25:17, 12.10s/it] 22%|██▏ | 365/1680 [1:14:05<4:25:56, 12.13s/it] 22%|██▏ | 365/1680 [1:14:05<4:25:56, 12.13s/it] 22%|██▏ | 366/1680 [1:14:16<4:20:50, 11.91s/it] 22%|██▏ | 367/1680 [1:14:28<4:19:27, 11.86s/it] 22%|██▏ | 368/1680 [1:14:41<4:22:27, 12.00s/it] 22%|██▏ | 369/1680 [1:14:53<4:26:40, 12.20s/it] 22%|██▏ | 370/1680 [1:15:05<4:23:25, 12.07s/it] 22%|██▏ | 370/1680 [1:15:05<4:23:25, 12.07s/it] 22%|██▏ | 371/1680 [1:15:17<4:20:32, 11.94s/it] 22%|██▏ | 372/1680 [1:15:29<4:22:30, 12.04s/it] 22%|██▏ | 373/1680 [1:15:40<4:16:50, 11.79s/it] 22%|██▏ | 374/1680 [1:15:52<4:15:21, 11.73s/it] 22%|██▏ | 375/1680 [1:16:05<4:25:24, 12.20s/it] 22%|██▏ | 375/1680 [1:16:05<4:25:24, 12.20s/it] 22%|██▏ | 376/1680 [1:16:17<4:21:20, 12.02s/it] 22%|██▏ | 377/1680 [1:16:29<4:20:31, 12.00s/it] 22%|██▎ | 378/1680 [1:16:40<4:17:14, 11.85s/it] 23%|██▎ | 379/1680 [1:16:52<4:15:40, 11.79s/it] 23%|██▎ | 380/1680 [1:17:04<4:17:18, 11.88s/it] 23%|██▎ | 380/1680 [1:17:04<4:17:18, 11.88s/it] 23%|██▎ | 381/1680 [1:17:15<4:15:04, 11.78s/it] 23%|██▎ | 382/1680 [1:17:26<4:10:25, 11.58s/it] 23%|██▎ | 383/1680 [1:17:39<4:15:31, 11.82s/it] 23%|██▎ | 384/1680 [1:17:51<4:16:30, 11.88s/it] 23%|██▎ | 385/1680 [1:18:03<4:20:06, 12.05s/it] 23%|██▎ | 385/1680 [1:18:03<4:20:06, 12.05s/it] 23%|██▎ | 386/1680 [1:18:15<4:19:13, 12.02s/it] 23%|██▎ | 387/1680 [1:18:28<4:21:55, 12.15s/it] 23%|██▎ | 388/1680 [1:18:39<4:16:42, 11.92s/it] 23%|██▎ | 389/1680 [1:18:52<4:23:54, 12.27s/it] 23%|██▎ | 390/1680 [1:19:03<4:14:00, 11.81s/it] 23%|██▎ | 390/1680 [1:19:03<4:14:00, 11.81s/it] 23%|██▎ | 391/1680 [1:19:16<4:19:00, 12.06s/it] 23%|██▎ | 392/1680 [1:19:27<4:17:20, 11.99s/it] 23%|██▎ | 393/1680 [1:19:39<4:18:04, 12.03s/it] 23%|██▎ | 394/1680 [1:19:51<4:16:11, 11.95s/it] 24%|██▎ | 395/1680 [1:20:03<4:14:05, 11.86s/it] 24%|██▎ | 395/1680 [1:20:03<4:14:05, 11.86s/it] 24%|██▎ | 396/1680 [1:20:15<4:13:59, 11.87s/it] 24%|██▎ | 397/1680 [1:20:26<4:11:39, 11.77s/it] 24%|██▎ | 398/1680 [1:20:39<4:15:22, 11.95s/it] 24%|██▍ | 399/1680 [1:20:51<4:15:40, 11.98s/it] 24%|██▍ | 400/1680 [1:21:03<4:20:23, 12.21s/it] 24%|██▍ | 400/1680 [1:21:03<4:20:23, 12.21s/it] 24%|██▍ | 401/1680 [1:21:16<4:18:59, 12.15s/it] 24%|██▍ | 402/1680 [1:21:28<4:18:04, 12.12s/it] 24%|██▍ | 403/1680 [1:21:39<4:16:28, 12.05s/it] 24%|██▍ | 404/1680 [1:21:51<4:15:48, 12.03s/it] 24%|██▍ | 405/1680 [1:22:03<4:12:18, 11.87s/it] 24%|██▍ | 405/1680 [1:22:03<4:12:18, 11.87s/it] 24%|██▍ | 406/1680 [1:22:15<4:16:00, 12.06s/it] 24%|██▍ | 407/1680 [1:22:27<4:12:29, 11.90s/it] 24%|██▍ | 408/1680 [1:22:40<4:16:36, 12.10s/it] 24%|██▍ | 409/1680 [1:22:52<4:15:43, 12.07s/it] 24%|██▍ | 410/1680 [1:23:04<4:20:28, 12.31s/it] 24%|██▍ | 410/1680 [1:23:04<4:20:28, 12.31s/it] 24%|██▍ | 411/1680 [1:23:17<4:20:33, 12.32s/it] 25%|██▍ | 412/1680 [1:23:29<4:19:45, 12.29s/it] 25%|██▍ | 413/1680 [1:23:41<4:16:13, 12.13s/it] 25%|██▍ | 414/1680 [1:23:54<4:20:12, 12.33s/it] 25%|██▍ | 415/1680 [1:24:05<4:17:19, 12.20s/it] 25%|██▍ | 415/1680 [1:24:05<4:17:19, 12.20s/it] 25%|██▍ | 416/1680 [1:24:17<4:14:12, 12.07s/it] 25%|██▍ | 417/1680 [1:24:30<4:17:14, 12.22s/it] 25%|██▍ | 418/1680 [1:24:42<4:15:54, 12.17s/it] 25%|██▍ | 419/1680 [1:24:54<4:15:37, 12.16s/it] 25%|██▌ | 420/1680 [1:25:05<4:10:02, 11.91s/it] 25%|██▌ | 420/1680 [1:25:05<4:10:02, 11.91s/it][INFO|trainer.py:3819] 2024-08-13 11:01:33,733 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 11:01:33,733 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 11:01:33,733 >> Batch size = 1 {'eval_loss': 1.3194608688354492, 'eval_runtime': 17.7382, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 1.25} {'loss': 1.0549, 'grad_norm': 1.881818175315857, 'learning_rate': 9.627309047276974e-05, 'epoch': 1.27} {'loss': 1.1576, 'grad_norm': 1.8770464658737183, 'learning_rate': 9.607381059352038e-05, 'epoch': 1.29} {'loss': 1.1246, 'grad_norm': 1.6901912689208984, 'learning_rate': 9.586955808137958e-05, 'epoch': 1.3} {'loss': 1.125, 'grad_norm': 1.7667070627212524, 'learning_rate': 9.566035498081784e-05, 'epoch': 1.32} {'loss': 1.1687, 'grad_norm': 1.6150933504104614, 'learning_rate': 9.544622387061055e-05, 'epoch': 1.34} {'loss': 0.9699, 'grad_norm': 1.5824884176254272, 'learning_rate': 9.522718786140097e-05, 'epoch': 1.36} {'loss': 1.1379, 'grad_norm': 1.5410280227661133, 'learning_rate': 9.500327059320606e-05, 'epoch': 1.37} {'loss': 1.0511, 'grad_norm': 2.264235496520996, 'learning_rate': 9.477449623286505e-05, 'epoch': 1.39} {'loss': 1.0003, 'grad_norm': 1.7440612316131592, 'learning_rate': 9.454088947143116e-05, 'epoch': 1.41} {'loss': 1.1631, 'grad_norm': 1.770466923713684, 'learning_rate': 9.430247552150673e-05, 'epoch': 1.43} {'loss': 1.045, 'grad_norm': 1.9537169933319092, 'learning_rate': 9.405928011452211e-05, 'epoch': 1.45} {'loss': 1.0511, 'grad_norm': 1.452445387840271, 'learning_rate': 9.381132949795861e-05, 'epoch': 1.46} {'loss': 1.1637, 'grad_norm': 2.176547050476074, 'learning_rate': 9.35586504325155e-05, 'epoch': 1.48} {'loss': 1.0783, 'grad_norm': 2.15567684173584, 'learning_rate': 9.330127018922194e-05, 'epoch': 1.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-420 [INFO|configuration_utils.py:733] 2024-08-13 11:01:52,102 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 11:01:52,103 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 11:01:52,349 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-420/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 11:01:52,349 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-420/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 25%|██▌ | 421/1680 [1:25:36<6:11:12, 17.69s/it] 25%|██▌ | 422/1680 [1:25:49<5:36:25, 16.05s/it] 25%|██▌ | 423/1680 [1:26:00<5:07:59, 14.70s/it] 25%|██▌ | 424/1680 [1:26:12<4:49:59, 13.85s/it] 25%|██▌ | 425/1680 [1:26:23<4:34:19, 13.12s/it] 25%|██▌ | 425/1680 [1:26:23<4:34:19, 13.12s/it] 25%|██▌ | 426/1680 [1:26:36<4:31:12, 12.98s/it] 25%|██▌ | 427/1680 [1:26:48<4:27:08, 12.79s/it] 25%|██▌ | 428/1680 [1:27:01<4:24:00, 12.65s/it] 26%|██▌ | 429/1680 [1:27:12<4:17:27, 12.35s/it] 26%|██▌ | 430/1680 [1:27:23<4:08:46, 11.94s/it] 26%|██▌ | 430/1680 [1:27:23<4:08:46, 11.94s/it] 26%|██▌ | 431/1680 [1:27:36<4:10:05, 12.01s/it] 26%|██▌ | 432/1680 [1:27:47<4:07:18, 11.89s/it] 26%|██▌ | 433/1680 [1:27:59<4:05:54, 11.83s/it] 26%|██▌ | 434/1680 [1:28:11<4:07:22, 11.91s/it] 26%|██▌ | 435/1680 [1:28:23<4:07:46, 11.94s/it] 26%|██▌ | 435/1680 [1:28:23<4:07:46, 11.94s/it] 26%|██▌ | 436/1680 [1:28:35<4:10:10, 12.07s/it] 26%|██▌ | 437/1680 [1:28:48<4:15:18, 12.32s/it] 26%|██▌ | 438/1680 [1:29:00<4:09:05, 12.03s/it] 26%|██▌ | 439/1680 [1:29:11<4:03:04, 11.75s/it] 26%|██▌ | 440/1680 [1:29:22<4:02:20, 11.73s/it] 26%|██▌ | 440/1680 [1:29:22<4:02:20, 11.73s/it] 26%|██▋ | 441/1680 [1:29:35<4:04:31, 11.84s/it] 26%|██▋ | 442/1680 [1:29:46<4:02:42, 11.76s/it] 26%|██▋ | 443/1680 [1:29:58<4:02:49, 11.78s/it] 26%|██▋ | 444/1680 [1:30:10<4:01:27, 11.72s/it] 26%|██▋ | 445/1680 [1:30:21<3:59:23, 11.63s/it] 26%|██▋ | 445/1680 [1:30:21<3:59:23, 11.63s/it] 27%|██▋ | 446/1680 [1:30:32<3:58:31, 11.60s/it] 27%|██▋ | 447/1680 [1:30:44<3:58:04, 11.59s/it] 27%|██▋ | 448/1680 [1:30:55<3:54:11, 11.41s/it] 27%|██▋ | 449/1680 [1:31:06<3:53:13, 11.37s/it] 27%|██▋ | 450/1680 [1:31:18<3:56:55, 11.56s/it] 27%|██▋ | 450/1680 [1:31:18<3:56:55, 11.56s/it] 27%|██▋ | 451/1680 [1:31:30<3:57:39, 11.60s/it] 27%|██▋ | 452/1680 [1:31:41<3:54:03, 11.44s/it] 27%|██▋ | 453/1680 [1:31:53<3:54:45, 11.48s/it] 27%|██▋ | 454/1680 [1:32:04<3:53:55, 11.45s/it] 27%|██▋ | 455/1680 [1:32:17<4:01:04, 11.81s/it] 27%|██▋ | 455/1680 [1:32:17<4:01:04, 11.81s/it] 27%|██▋ | 456/1680 [1:32:28<3:58:51, 11.71s/it] 27%|██▋ | 457/1680 [1:32:39<3:55:36, 11.56s/it] 27%|██▋ | 458/1680 [1:32:50<3:52:20, 11.41s/it] 27%|██▋ | 459/1680 [1:33:02<3:51:35, 11.38s/it] 27%|██▋ | 460/1680 [1:33:13<3:49:21, 11.28s/it] 27%|██▋ | 460/1680 [1:33:13<3:49:21, 11.28s/it] 27%|██▋ | 461/1680 [1:33:25<3:56:37, 11.65s/it] 28%|██▊ | 462/1680 [1:33:37<3:59:37, 11.80s/it] 28%|██▊ | 463/1680 [1:33:50<4:03:16, 11.99s/it] 28%|██▊ | 464/1680 [1:34:02<4:04:37, 12.07s/it] 28%|██▊ | 465/1680 [1:34:15<4:08:21, 12.26s/it] 28%|██▊ | 465/1680 [1:34:15<4:08:21, 12.26s/it] 28%|██▊ | 466/1680 [1:34:27<4:09:49, 12.35s/it] 28%|██▊ | 467/1680 [1:34:39<4:04:22, 12.09s/it] 28%|██▊ | 468/1680 [1:34:51<4:07:27, 12.25s/it] 28%|██▊ | 469/1680 [1:35:04<4:07:07, 12.24s/it] 28%|██▊ | 470/1680 [1:35:16<4:05:23, 12.17s/it] 28%|██▊ | 470/1680 [1:35:16<4:05:23, 12.17s/it] 28%|██▊ | 471/1680 [1:35:28<4:06:12, 12.22s/it] 28%|██▊ | 472/1680 [1:35:40<4:06:22, 12.24s/it] 28%|██▊ | 473/1680 [1:35:52<4:01:04, 11.98s/it] 28%|██▊ | 474/1680 [1:36:03<3:57:56, 11.84s/it] 28%|██▊ | 475/1680 [1:36:15<3:56:39, 11.78s/it] 28%|██▊ | 475/1680 [1:36:15<3:56:39, 11.78s/it] 28%|██▊ | 476/1680 [1:36:27<3:57:17, 11.83s/it] 28%|██▊ | 477/1680 [1:36:39<3:59:24, 11.94s/it] 28%|██▊ | 478/1680 [1:36:51<3:59:48, 11.97s/it] 29%|██▊ | 479/1680 [1:37:03<3:58:03, 11.89s/it] 29%|██▊ | 480/1680 [1:37:14<3:52:59, 11.65s/it] 29%|██▊ | 480/1680 [1:37:14<3:52:59, 11.65s/it] 29%|██▊ | 481/1680 [1:37:26<3:53:16, 11.67s/it] 29%|██▊ | 482/1680 [1:37:37<3:51:51, 11.61s/it] 29%|██▉ | 483/1680 [1:37:48<3:50:10, 11.54s/it] 29%|██▉ | 484/1680 [1:38:01<3:54:46, 11.78s/it] 29%|██▉ | 485/1680 [1:38:13<3:58:58, 12.00s/it] 29%|██▉ | 485/1680 [1:38:13<3:58:58, 12.00s/it] 29%|██▉ | 486/1680 [1:38:25<3:57:41, 11.94s/it] 29%|██▉ | 487/1680 [1:38:37<3:57:46, 11.96s/it] 29%|██▉ | 488/1680 [1:38:49<3:57:40, 11.96s/it] 29%|██▉ | 489/1680 [1:39:00<3:54:14, 11.80s/it] 29%|██▉ | 490/1680 [1:39:13<3:57:46, 11.99s/it] 29%|██▉ | 490/1680 [1:39:13<3:57:46, 11.99s/it][INFO|trainer.py:3819] 2024-08-13 11:15:41,364 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 11:15:41,364 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 11:15:41,364 >> Batch size = 1 {'eval_loss': 1.3106330633163452, 'eval_runtime': 17.7447, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 1.5} {'loss': 1.0406, 'grad_norm': 1.6800014972686768, 'learning_rate': 9.303921654649362e-05, 'epoch': 1.52} {'loss': 1.1469, 'grad_norm': 1.926607370376587, 'learning_rate': 9.277251778713474e-05, 'epoch': 1.54} {'loss': 1.0453, 'grad_norm': 1.7155028581619263, 'learning_rate': 9.250120269528546e-05, 'epoch': 1.55} {'loss': 1.0611, 'grad_norm': 1.9001247882843018, 'learning_rate': 9.22253005533154e-05, 'epoch': 1.57} {'loss': 1.082, 'grad_norm': 2.2804248332977295, 'learning_rate': 9.194484113866313e-05, 'epoch': 1.59} {'loss': 1.2404, 'grad_norm': 1.9318439960479736, 'learning_rate': 9.165985472062246e-05, 'epoch': 1.61} {'loss': 1.0436, 'grad_norm': 1.6018136739730835, 'learning_rate': 9.137037205707552e-05, 'epoch': 1.62} {'loss': 1.1227, 'grad_norm': 2.1986541748046875, 'learning_rate': 9.107642439117321e-05, 'epoch': 1.64} {'loss': 1.0858, 'grad_norm': 1.5558295249938965, 'learning_rate': 9.077804344796302e-05, 'epoch': 1.66} {'loss': 1.0998, 'grad_norm': 1.8423618078231812, 'learning_rate': 9.04752614309652e-05, 'epoch': 1.68} {'loss': 1.0433, 'grad_norm': 1.9065622091293335, 'learning_rate': 9.01681110186971e-05, 'epoch': 1.7} {'loss': 1.0798, 'grad_norm': 2.0103020668029785, 'learning_rate': 8.985662536114613e-05, 'epoch': 1.71} {'loss': 1.1012, 'grad_norm': 1.5299313068389893, 'learning_rate': 8.954083807619208e-05, 'epoch': 1.73} {'loss': 1.1219, 'grad_norm': 1.6331924200057983, 'learning_rate': 8.922078324597879e-05, 'epoch': 1.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-490 [INFO|configuration_utils.py:733] 2024-08-13 11:15:59,806 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 11:15:59,806 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 11:16:00,034 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-490/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 11:16:00,034 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-490/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 29%|██▉ | 491/1680 [1:39:45<5:54:44, 17.90s/it] 29%|██▉ | 492/1680 [1:39:56<5:16:05, 15.96s/it] 29%|██▉ | 493/1680 [1:40:08<4:51:42, 14.75s/it] 29%|██▉ | 494/1680 [1:40:20<4:35:31, 13.94s/it] 29%|██▉ | 495/1680 [1:40:32<4:24:39, 13.40s/it] 29%|██▉ | 495/1680 [1:40:32<4:24:39, 13.40s/it] 30%|██▉ | 496/1680 [1:40:44<4:14:06, 12.88s/it] 30%|██▉ | 497/1680 [1:40:56<4:10:45, 12.72s/it] 30%|██▉ | 498/1680 [1:41:09<4:10:00, 12.69s/it] 30%|██▉ | 499/1680 [1:41:21<4:06:03, 12.50s/it] 30%|██▉ | 500/1680 [1:41:32<4:00:28, 12.23s/it] 30%|██▉ | 500/1680 [1:41:32<4:00:28, 12.23s/it] 30%|██▉ | 501/1680 [1:41:45<4:03:39, 12.40s/it] 30%|██▉ | 502/1680 [1:41:57<3:58:07, 12.13s/it] 30%|██▉ | 503/1680 [1:42:09<3:58:28, 12.16s/it] 30%|███ | 504/1680 [1:42:21<3:58:05, 12.15s/it] 30%|███ | 505/1680 [1:42:34<4:00:17, 12.27s/it] 30%|███ | 505/1680 [1:42:34<4:00:17, 12.27s/it] 30%|███ | 506/1680 [1:42:45<3:56:05, 12.07s/it] 30%|███ | 507/1680 [1:42:57<3:55:54, 12.07s/it] 30%|███ | 508/1680 [1:43:09<3:51:30, 11.85s/it] 30%|███ | 509/1680 [1:43:21<3:51:36, 11.87s/it] 30%|███ | 510/1680 [1:43:32<3:48:49, 11.73s/it] 30%|███ | 510/1680 [1:43:32<3:48:49, 11.73s/it] 30%|███ | 511/1680 [1:43:44<3:48:12, 11.71s/it] 30%|███ | 512/1680 [1:43:55<3:45:50, 11.60s/it] 31%|███ | 513/1680 [1:44:07<3:46:15, 11.63s/it] 31%|███ | 514/1680 [1:44:19<3:50:04, 11.84s/it] 31%|███ | 515/1680 [1:44:31<3:52:06, 11.95s/it] 31%|███ | 515/1680 [1:44:31<3:52:06, 11.95s/it] 31%|███ | 516/1680 [1:44:43<3:52:36, 11.99s/it] 31%|███ | 517/1680 [1:44:55<3:52:40, 12.00s/it] 31%|███ | 518/1680 [1:45:07<3:53:14, 12.04s/it] 31%|███ | 519/1680 [1:45:19<3:50:49, 11.93s/it] 31%|███ | 520/1680 [1:45:32<3:56:55, 12.25s/it] 31%|███ | 520/1680 [1:45:32<3:56:55, 12.25s/it] 31%|███ | 521/1680 [1:45:44<3:55:09, 12.17s/it] 31%|███ | 522/1680 [1:45:55<3:49:26, 11.89s/it] 31%|███ | 523/1680 [1:46:07<3:46:11, 11.73s/it] 31%|███ | 524/1680 [1:46:18<3:44:07, 11.63s/it] 31%|███▏ | 525/1680 [1:46:30<3:43:49, 11.63s/it] 31%|███▏ | 525/1680 [1:46:30<3:43:49, 11.63s/it] 31%|███▏ | 526/1680 [1:46:41<3:43:01, 11.60s/it] 31%|███▏ | 527/1680 [1:46:54<3:47:13, 11.82s/it] 31%|███▏ | 528/1680 [1:47:06<3:48:25, 11.90s/it] 31%|███▏ | 529/1680 [1:47:18<3:49:48, 11.98s/it] 32%|███▏ | 530/1680 [1:47:29<3:44:23, 11.71s/it] 32%|███▏ | 530/1680 [1:47:29<3:44:23, 11.71s/it] 32%|███▏ | 531/1680 [1:47:41<3:48:07, 11.91s/it] 32%|███▏ | 532/1680 [1:47:53<3:46:05, 11.82s/it] 32%|███▏ | 533/1680 [1:48:04<3:43:23, 11.69s/it] 32%|███▏ | 534/1680 [1:48:16<3:40:55, 11.57s/it] 32%|███▏ | 535/1680 [1:48:28<3:44:16, 11.75s/it] 32%|███▏ | 535/1680 [1:48:28<3:44:16, 11.75s/it] 32%|███▏ | 536/1680 [1:48:40<3:44:42, 11.79s/it] 32%|███▏ | 537/1680 [1:48:51<3:43:32, 11.73s/it] 32%|███▏ | 538/1680 [1:49:04<3:49:12, 12.04s/it] 32%|███▏ | 539/1680 [1:49:16<3:46:37, 11.92s/it] 32%|███▏ | 540/1680 [1:49:28<3:48:20, 12.02s/it] 32%|███▏ | 540/1680 [1:49:28<3:48:20, 12.02s/it] 32%|███▏ | 541/1680 [1:49:40<3:47:43, 12.00s/it] 32%|███▏ | 542/1680 [1:49:52<3:48:23, 12.04s/it] 32%|███▏ | 543/1680 [1:50:03<3:44:55, 11.87s/it] 32%|███▏ | 544/1680 [1:50:15<3:43:22, 11.80s/it] 32%|███▏ | 545/1680 [1:50:27<3:42:01, 11.74s/it] 32%|███▏ | 545/1680 [1:50:27<3:42:01, 11.74s/it] 32%|███▎ | 546/1680 [1:50:38<3:41:07, 11.70s/it] 33%|███▎ | 547/1680 [1:50:51<3:44:50, 11.91s/it] 33%|███▎ | 548/1680 [1:51:02<3:43:10, 11.83s/it] 33%|███▎ | 549/1680 [1:51:15<3:45:30, 11.96s/it] 33%|███▎ | 550/1680 [1:51:26<3:40:23, 11.70s/it] 33%|███▎ | 550/1680 [1:51:26<3:40:23, 11.70s/it] 33%|███▎ | 551/1680 [1:51:38<3:42:01, 11.80s/it] 33%|███▎ | 552/1680 [1:51:50<3:44:06, 11.92s/it] 33%|███▎ | 553/1680 [1:52:02<3:46:56, 12.08s/it] 33%|███▎ | 554/1680 [1:52:14<3:46:47, 12.08s/it] 33%|███▎ | 555/1680 [1:52:27<3:46:57, 12.10s/it] 33%|███▎ | 555/1680 [1:52:27<3:46:57, 12.10s/it] 33%|███▎ | 556/1680 [1:52:38<3:40:23, 11.76s/it] 33%|███▎ | 557/1680 [1:52:50<3:43:56, 11.97s/it] 33%|███▎ | 558/1680 [1:53:02<3:44:51, 12.02s/it] 33%|███▎ | 559/1680 [1:53:15<3:46:26, 12.12s/it] 33%|███▎ | 560/1680 [1:53:26<3:43:53, 11.99s/it] 33%|███▎ | 560/1680 [1:53:26<3:43:53, 11.99s/it][INFO|trainer.py:3819] 2024-08-13 11:29:54,681 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 11:29:54,681 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 11:29:54,681 >> Batch size = 1 {'eval_loss': 1.3044873476028442, 'eval_runtime': 17.7401, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 1.75} {'loss': 1.16, 'grad_norm': 1.6050705909729004, 'learning_rate': 8.889649541323574e-05, 'epoch': 1.77} {'loss': 1.091, 'grad_norm': 1.7604998350143433, 'learning_rate': 8.856800957755e-05, 'epoch': 1.78} {'loss': 1.072, 'grad_norm': 1.6485258340835571, 'learning_rate': 8.823536119158864e-05, 'epoch': 1.8} {'loss': 1.0635, 'grad_norm': 1.8173716068267822, 'learning_rate': 8.789858615727265e-05, 'epoch': 1.82} {'loss': 1.0258, 'grad_norm': 1.468127965927124, 'learning_rate': 8.755772082190194e-05, 'epoch': 1.84} {'loss': 1.2011, 'grad_norm': 1.4476536512374878, 'learning_rate': 8.721280197423258e-05, 'epoch': 1.86} {'loss': 1.0539, 'grad_norm': 2.054915189743042, 'learning_rate': 8.68638668405062e-05, 'epoch': 1.87} {'loss': 1.0948, 'grad_norm': 1.8471094369888306, 'learning_rate': 8.651095308043232e-05, 'epoch': 1.89} {'loss': 1.1245, 'grad_norm': 1.7790355682373047, 'learning_rate': 8.61540987831238e-05, 'epoch': 1.91} {'loss': 1.2039, 'grad_norm': 1.6644902229309082, 'learning_rate': 8.579334246298593e-05, 'epoch': 1.93} {'loss': 1.1077, 'grad_norm': 1.9952303171157837, 'learning_rate': 8.542872305555978e-05, 'epoch': 1.95} {'loss': 1.0603, 'grad_norm': 2.225977659225464, 'learning_rate': 8.50602799133199e-05, 'epoch': 1.96} {'loss': 1.1376, 'grad_norm': 1.777342438697815, 'learning_rate': 8.468805280142709e-05, 'epoch': 1.98} {'loss': 1.0966, 'grad_norm': 2.2195017337799072, 'learning_rate': 8.43120818934367e-05, 'epoch': 2.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-560 [INFO|configuration_utils.py:733] 2024-08-13 11:30:13,119 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 11:30:13,120 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 11:30:13,347 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-560/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 11:30:13,348 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-560/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 33%|███▎ | 561/1680 [1:53:58<5:33:42, 17.89s/it] 33%|███▎ | 562/1680 [1:54:10<5:00:48, 16.14s/it] 34%|███▎ | 563/1680 [1:54:22<4:35:07, 14.78s/it] 34%|███▎ | 564/1680 [1:54:33<4:14:22, 13.68s/it] 34%|███▎ | 565/1680 [1:54:44<4:02:12, 13.03s/it] 34%|███▎ | 565/1680 [1:54:44<4:02:12, 13.03s/it] 34%|███▎ | 566/1680 [1:54:55<3:51:51, 12.49s/it] 34%|███▍ | 567/1680 [1:55:08<3:54:55, 12.66s/it] 34%|███▍ | 568/1680 [1:55:20<3:46:55, 12.24s/it] 34%|███▍ | 569/1680 [1:55:32<3:44:43, 12.14s/it] 34%|███▍ | 570/1680 [1:55:43<3:38:44, 11.82s/it] 34%|███▍ | 570/1680 [1:55:43<3:38:44, 11.82s/it] 34%|███▍ | 571/1680 [1:55:56<3:44:39, 12.15s/it] 34%|███▍ | 572/1680 [1:56:08<3:43:34, 12.11s/it] 34%|███▍ | 573/1680 [1:56:20<3:43:08, 12.09s/it] 34%|███▍ | 574/1680 [1:56:31<3:41:21, 12.01s/it] 34%|███▍ | 575/1680 [1:56:43<3:35:45, 11.72s/it] 34%|███▍ | 575/1680 [1:56:43<3:35:45, 11.72s/it] 34%|███▍ | 576/1680 [1:56:55<3:39:27, 11.93s/it] 34%|███▍ | 577/1680 [1:57:06<3:35:42, 11.73s/it] 34%|███▍ | 578/1680 [1:57:18<3:36:48, 11.80s/it] 34%|███▍ | 579/1680 [1:57:30<3:34:16, 11.68s/it] 35%|███▍ | 580/1680 [1:57:41<3:31:07, 11.52s/it] 35%|███▍ | 580/1680 [1:57:41<3:31:07, 11.52s/it] 35%|███▍ | 581/1680 [1:57:53<3:33:33, 11.66s/it] 35%|███▍ | 582/1680 [1:58:04<3:29:51, 11.47s/it] 35%|███▍ | 583/1680 [1:58:16<3:33:46, 11.69s/it] 35%|███▍ | 584/1680 [1:58:27<3:31:50, 11.60s/it] 35%|███▍ | 585/1680 [1:58:39<3:30:06, 11.51s/it] 35%|███▍ | 585/1680 [1:58:39<3:30:06, 11.51s/it] 35%|███▍ | 586/1680 [1:58:50<3:29:35, 11.50s/it] 35%|███▍ | 587/1680 [1:59:03<3:36:51, 11.90s/it] 35%|███▌ | 588/1680 [1:59:16<3:40:33, 12.12s/it] 35%|███▌ | 589/1680 [1:59:29<3:48:28, 12.57s/it] 35%|███▌ | 590/1680 [1:59:40<3:39:48, 12.10s/it] 35%|███▌ | 590/1680 [1:59:40<3:39:48, 12.10s/it] 35%|███▌ | 591/1680 [1:59:52<3:40:08, 12.13s/it] 35%|███▌ | 592/1680 [2:00:05<3:42:28, 12.27s/it] 35%|███▌ | 593/1680 [2:00:17<3:40:54, 12.19s/it] 35%|███▌ | 594/1680 [2:00:29<3:39:17, 12.12s/it] 35%|███▌ | 595/1680 [2:00:41<3:37:28, 12.03s/it] 35%|███▌ | 595/1680 [2:00:41<3:37:28, 12.03s/it] 35%|███▌ | 596/1680 [2:00:53<3:37:00, 12.01s/it] 36%|███▌ | 597/1680 [2:01:05<3:37:30, 12.05s/it] 36%|███▌ | 598/1680 [2:01:17<3:37:22, 12.05s/it] 36%|███▌ | 599/1680 [2:01:29<3:39:02, 12.16s/it] 36%|███▌ | 600/1680 [2:01:41<3:36:18, 12.02s/it] 36%|███▌ | 600/1680 [2:01:41<3:36:18, 12.02s/it] 36%|███▌ | 601/1680 [2:01:53<3:35:06, 11.96s/it] 36%|███▌ | 602/1680 [2:02:04<3:32:58, 11.85s/it] 36%|███▌ | 603/1680 [2:02:16<3:30:56, 11.75s/it] 36%|███▌ | 604/1680 [2:02:28<3:34:54, 11.98s/it] 36%|███▌ | 605/1680 [2:02:40<3:32:14, 11.85s/it] 36%|███▌ | 605/1680 [2:02:40<3:32:14, 11.85s/it] 36%|███▌ | 606/1680 [2:02:51<3:30:06, 11.74s/it] 36%|███▌ | 607/1680 [2:03:03<3:30:22, 11.76s/it] 36%|███▌ | 608/1680 [2:03:15<3:31:38, 11.85s/it] 36%|███▋ | 609/1680 [2:03:26<3:26:55, 11.59s/it] 36%|███▋ | 610/1680 [2:03:38<3:27:16, 11.62s/it] 36%|███▋ | 610/1680 [2:03:38<3:27:16, 11.62s/it] 36%|███▋ | 611/1680 [2:03:50<3:27:55, 11.67s/it] 36%|███▋ | 612/1680 [2:04:02<3:28:33, 11.72s/it] 36%|███▋ | 613/1680 [2:04:14<3:30:25, 11.83s/it] 37%|███▋ | 614/1680 [2:04:25<3:29:34, 11.80s/it] 37%|███▋ | 615/1680 [2:04:38<3:33:00, 12.00s/it] 37%|███▋ | 615/1680 [2:04:38<3:33:00, 12.00s/it] 37%|███▋ | 616/1680 [2:04:50<3:33:12, 12.02s/it] 37%|███▋ | 617/1680 [2:05:02<3:30:58, 11.91s/it] 37%|███▋ | 618/1680 [2:05:13<3:29:19, 11.83s/it] 37%|███▋ | 619/1680 [2:05:25<3:30:15, 11.89s/it] 37%|███▋ | 620/1680 [2:05:38<3:32:30, 12.03s/it] 37%|███▋ | 620/1680 [2:05:38<3:32:30, 12.03s/it] 37%|███▋ | 621/1680 [2:05:50<3:34:32, 12.15s/it] 37%|███▋ | 622/1680 [2:06:02<3:31:07, 11.97s/it] 37%|███▋ | 623/1680 [2:06:13<3:29:40, 11.90s/it] 37%|███▋ | 624/1680 [2:06:25<3:26:26, 11.73s/it] 37%|███▋ | 625/1680 [2:06:36<3:24:12, 11.61s/it] 37%|███▋ | 625/1680 [2:06:36<3:24:12, 11.61s/it] 37%|███▋ | 626/1680 [2:06:48<3:25:02, 11.67s/it] 37%|███▋ | 627/1680 [2:07:00<3:28:03, 11.86s/it] 37%|███▋ | 628/1680 [2:07:13<3:30:43, 12.02s/it] 37%|███▋ | 629/1680 [2:07:24<3:27:38, 11.85s/it] 38%|███▊ | 630/1680 [2:07:36<3:25:18, 11.73s/it] 38%|███▊ | 630/1680 [2:07:36<3:25:18, 11.73s/it][INFO|trainer.py:3819] 2024-08-13 11:44:03,984 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 11:44:03,984 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 11:44:03,984 >> Batch size = 1 {'eval_loss': 1.3094360828399658, 'eval_runtime': 17.7539, 'eval_samples_per_second': 2.591, 'eval_steps_per_second': 2.591, 'epoch': 2.0} {'loss': 0.6867, 'grad_norm': 2.012312173843384, 'learning_rate': 8.393240776696274e-05, 'epoch': 2.02} {'loss': 0.6025, 'grad_norm': 3.092951774597168, 'learning_rate': 8.354907139929851e-05, 'epoch': 2.03} {'loss': 0.6497, 'grad_norm': 4.8303399085998535, 'learning_rate': 8.316211416299397e-05, 'epoch': 2.05} {'loss': 0.5803, 'grad_norm': 3.1457698345184326, 'learning_rate': 8.27715778213905e-05, 'epoch': 2.07} {'loss': 0.494, 'grad_norm': 2.5240321159362793, 'learning_rate': 8.237750452411353e-05, 'epoch': 2.09} {'loss': 0.6428, 'grad_norm': 2.630946636199951, 'learning_rate': 8.197993680252334e-05, 'epoch': 2.11} {'loss': 0.6612, 'grad_norm': 2.9942588806152344, 'learning_rate': 8.157891756512488e-05, 'epoch': 2.12} {'loss': 0.5783, 'grad_norm': 2.8771650791168213, 'learning_rate': 8.117449009293668e-05, 'epoch': 2.14} {'loss': 0.5799, 'grad_norm': 3.1111013889312744, 'learning_rate': 8.076669803481965e-05, 'epoch': 2.16} {'loss': 0.5344, 'grad_norm': 3.715027093887329, 'learning_rate': 8.035558540276618e-05, 'epoch': 2.18} {'loss': 0.5605, 'grad_norm': 2.936890125274658, 'learning_rate': 7.994119656715002e-05, 'epoch': 2.2} {'loss': 0.5923, 'grad_norm': 2.79441499710083, 'learning_rate': 7.952357625193749e-05, 'epoch': 2.21} {'loss': 0.6067, 'grad_norm': 3.444474697113037, 'learning_rate': 7.91027695298606e-05, 'epoch': 2.23} {'loss': 0.6134, 'grad_norm': 3.034071445465088, 'learning_rate': 7.86788218175523e-05, 'epoch': 2.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-630 [INFO|configuration_utils.py:733] 2024-08-13 11:44:22,294 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 11:44:22,295 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 11:44:22,520 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-630/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 11:44:22,520 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-630/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 38%|███▊ | 631/1680 [2:08:06<5:03:51, 17.38s/it] 38%|███▊ | 632/1680 [2:08:18<4:33:21, 15.65s/it] 38%|███▊ | 633/1680 [2:08:29<4:10:57, 14.38s/it] 38%|███▊ | 634/1680 [2:08:41<3:58:31, 13.68s/it] 38%|███▊ | 635/1680 [2:08:53<3:48:01, 13.09s/it] 38%|███▊ | 635/1680 [2:08:53<3:48:01, 13.09s/it] 38%|███▊ | 636/1680 [2:09:05<3:42:53, 12.81s/it] 38%|███▊ | 637/1680 [2:09:17<3:37:15, 12.50s/it] 38%|███▊ | 638/1680 [2:09:29<3:33:57, 12.32s/it] 38%|███▊ | 639/1680 [2:09:41<3:31:33, 12.19s/it] 38%|███▊ | 640/1680 [2:09:52<3:26:35, 11.92s/it] 38%|███▊ | 640/1680 [2:09:52<3:26:35, 11.92s/it] 38%|███▊ | 641/1680 [2:10:04<3:28:11, 12.02s/it] 38%|███▊ | 642/1680 [2:10:16<3:27:30, 11.99s/it] 38%|███▊ | 643/1680 [2:10:27<3:21:12, 11.64s/it] 38%|███▊ | 644/1680 [2:10:39<3:24:15, 11.83s/it] 38%|███▊ | 645/1680 [2:10:51<3:23:04, 11.77s/it] 38%|███▊ | 645/1680 [2:10:51<3:23:04, 11.77s/it] 38%|███▊ | 646/1680 [2:11:02<3:19:31, 11.58s/it] 39%|███▊ | 647/1680 [2:11:14<3:20:37, 11.65s/it] 39%|███▊ | 648/1680 [2:11:26<3:22:55, 11.80s/it] 39%|███▊ | 649/1680 [2:11:38<3:22:43, 11.80s/it] 39%|███▊ | 650/1680 [2:11:49<3:20:26, 11.68s/it] 39%|███▊ | 650/1680 [2:11:49<3:20:26, 11.68s/it] 39%|███▉ | 651/1680 [2:12:01<3:21:02, 11.72s/it] 39%|███▉ | 652/1680 [2:12:14<3:25:26, 11.99s/it] 39%|███▉ | 653/1680 [2:12:25<3:24:42, 11.96s/it] 39%|███▉ | 654/1680 [2:12:38<3:27:05, 12.11s/it] 39%|███▉ | 655/1680 [2:12:50<3:26:21, 12.08s/it] 39%|███▉ | 655/1680 [2:12:50<3:26:21, 12.08s/it] 39%|███▉ | 656/1680 [2:13:01<3:22:37, 11.87s/it] 39%|███▉ | 657/1680 [2:13:14<3:26:30, 12.11s/it] 39%|███▉ | 658/1680 [2:13:25<3:23:06, 11.92s/it] 39%|███▉ | 659/1680 [2:13:37<3:21:46, 11.86s/it] 39%|███▉ | 660/1680 [2:13:49<3:22:19, 11.90s/it] 39%|███▉ | 660/1680 [2:13:49<3:22:19, 11.90s/it] 39%|███▉ | 661/1680 [2:14:01<3:21:05, 11.84s/it] 39%|███▉ | 662/1680 [2:14:13<3:23:10, 11.97s/it] 39%|███▉ | 663/1680 [2:14:25<3:25:06, 12.10s/it] 40%|███▉ | 664/1680 [2:14:38<3:27:35, 12.26s/it] 40%|███▉ | 665/1680 [2:14:50<3:23:02, 12.00s/it] 40%|███▉ | 665/1680 [2:14:50<3:23:02, 12.00s/it] 40%|███▉ | 666/1680 [2:15:01<3:22:38, 11.99s/it] 40%|███▉ | 667/1680 [2:15:14<3:22:57, 12.02s/it] 40%|███▉ | 668/1680 [2:15:26<3:25:14, 12.17s/it] 40%|███▉ | 669/1680 [2:15:40<3:31:48, 12.57s/it] 40%|███▉ | 670/1680 [2:15:52<3:28:14, 12.37s/it] 40%|███▉ | 670/1680 [2:15:52<3:28:14, 12.37s/it] 40%|███▉ | 671/1680 [2:16:03<3:21:15, 11.97s/it] 40%|████ | 672/1680 [2:16:14<3:20:56, 11.96s/it] 40%|████ | 673/1680 [2:16:26<3:19:20, 11.88s/it] 40%|████ | 674/1680 [2:16:38<3:21:16, 12.00s/it] 40%|████ | 675/1680 [2:16:50<3:19:14, 11.89s/it] 40%|████ | 675/1680 [2:16:50<3:19:14, 11.89s/it] 40%|████ | 676/1680 [2:17:02<3:18:06, 11.84s/it] 40%|████ | 677/1680 [2:17:14<3:17:13, 11.80s/it] 40%|████ | 678/1680 [2:17:25<3:17:27, 11.82s/it] 40%|████ | 679/1680 [2:17:38<3:19:50, 11.98s/it] 40%|████ | 680/1680 [2:17:49<3:18:18, 11.90s/it] 40%|████ | 680/1680 [2:17:49<3:18:18, 11.90s/it] 41%|████ | 681/1680 [2:18:01<3:16:02, 11.77s/it] 41%|████ | 682/1680 [2:18:13<3:17:37, 11.88s/it] 41%|████ | 683/1680 [2:18:25<3:18:02, 11.92s/it] 41%|████ | 684/1680 [2:18:37<3:16:45, 11.85s/it] 41%|████ | 685/1680 [2:18:49<3:20:05, 12.07s/it] 41%|████ | 685/1680 [2:18:49<3:20:05, 12.07s/it] 41%|████ | 686/1680 [2:19:01<3:16:12, 11.84s/it] 41%|████ | 687/1680 [2:19:12<3:13:29, 11.69s/it] 41%|████ | 688/1680 [2:19:24<3:14:57, 11.79s/it] 41%|████ | 689/1680 [2:19:36<3:17:24, 11.95s/it] 41%|████ | 690/1680 [2:19:48<3:16:12, 11.89s/it] 41%|████ | 690/1680 [2:19:48<3:16:12, 11.89s/it] 41%|████ | 691/1680 [2:20:00<3:14:51, 11.82s/it] 41%|████ | 692/1680 [2:20:12<3:18:19, 12.04s/it] 41%|████▏ | 693/1680 [2:20:23<3:13:17, 11.75s/it] 41%|████▏ | 694/1680 [2:20:35<3:12:59, 11.74s/it] 41%|████▏ | 695/1680 [2:20:47<3:11:26, 11.66s/it] 41%|████▏ | 695/1680 [2:20:47<3:11:26, 11.66s/it] 41%|████▏ | 696/1680 [2:20:58<3:11:29, 11.68s/it] 41%|████▏ | 697/1680 [2:21:10<3:13:04, 11.78s/it] 42%|████▏ | 698/1680 [2:21:22<3:11:49, 11.72s/it] 42%|████▏ | 699/1680 [2:21:34<3:11:59, 11.74s/it] 42%|████▏ | 700/1680 [2:21:46<3:13:41, 11.86s/it] 42%|████▏ | 700/1680 [2:21:46<3:13:41, 11.86s/it][INFO|trainer.py:3819] 2024-08-13 11:58:14,309 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 11:58:14,309 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 11:58:14,309 >> Batch size = 1 {'eval_loss': 1.4945974349975586, 'eval_runtime': 17.7423, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 2.25} {'loss': 0.5798, 'grad_norm': 3.0743188858032227, 'learning_rate': 7.8251778870645e-05, 'epoch': 2.27} {'loss': 0.5705, 'grad_norm': 3.250493049621582, 'learning_rate': 7.782168677883206e-05, 'epoch': 2.28} {'loss': 0.6119, 'grad_norm': 2.4863390922546387, 'learning_rate': 7.738859196089358e-05, 'epoch': 2.3} {'loss': 0.6352, 'grad_norm': 3.1027884483337402, 'learning_rate': 7.695254115968648e-05, 'epoch': 2.32} {'loss': 0.6341, 'grad_norm': 2.840583562850952, 'learning_rate': 7.651358143709972e-05, 'epoch': 2.34} {'loss': 0.6695, 'grad_norm': 3.057770252227783, 'learning_rate': 7.60717601689749e-05, 'epoch': 2.36} {'loss': 0.5715, 'grad_norm': 3.563372850418091, 'learning_rate': 7.562712503999327e-05, 'epoch': 2.37} {'loss': 0.7753, 'grad_norm': 3.2286486625671387, 'learning_rate': 7.517972403852905e-05, 'epoch': 2.39} {'loss': 0.5529, 'grad_norm': 2.9088051319122314, 'learning_rate': 7.472960545147038e-05, 'epoch': 2.41} {'loss': 0.5715, 'grad_norm': 2.9432833194732666, 'learning_rate': 7.427681785900761e-05, 'epoch': 2.43} {'loss': 0.6085, 'grad_norm': 2.483222723007202, 'learning_rate': 7.382141012939034e-05, 'epoch': 2.45} {'loss': 0.627, 'grad_norm': 2.9013617038726807, 'learning_rate': 7.33634314136531e-05, 'epoch': 2.46} {'loss': 0.6403, 'grad_norm': 2.746309995651245, 'learning_rate': 7.290293114031061e-05, 'epoch': 2.48} {'loss': 0.6342, 'grad_norm': 2.8350794315338135, 'learning_rate': 7.243995901002312e-05, 'epoch': 2.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-700 [INFO|configuration_utils.py:733] 2024-08-13 11:58:32,670 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 11:58:32,671 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 11:58:32,895 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-700/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 11:58:32,896 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-700/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 42%|████▏ | 701/1680 [2:22:17<4:47:49, 17.64s/it] 42%|████▏ | 702/1680 [2:22:29<4:22:29, 16.10s/it] 42%|████▏ | 703/1680 [2:22:42<4:02:21, 14.88s/it] 42%|████▏ | 704/1680 [2:22:54<3:49:41, 14.12s/it] 42%|████▏ | 705/1680 [2:23:06<3:37:36, 13.39s/it] 42%|████▏ | 705/1680 [2:23:06<3:37:36, 13.39s/it] 42%|████▏ | 706/1680 [2:23:18<3:32:18, 13.08s/it] 42%|████▏ | 707/1680 [2:23:30<3:27:48, 12.81s/it] 42%|████▏ | 708/1680 [2:23:42<3:22:31, 12.50s/it] 42%|████▏ | 709/1680 [2:23:54<3:18:58, 12.30s/it] 42%|████▏ | 710/1680 [2:24:06<3:18:52, 12.30s/it] 42%|████▏ | 710/1680 [2:24:06<3:18:52, 12.30s/it] 42%|████▏ | 711/1680 [2:24:17<3:12:41, 11.93s/it] 42%|████▏ | 712/1680 [2:24:29<3:12:13, 11.91s/it] 42%|████▏ | 713/1680 [2:24:41<3:10:31, 11.82s/it] 42%|████▎ | 714/1680 [2:24:52<3:10:30, 11.83s/it] 43%|████▎ | 715/1680 [2:25:06<3:17:53, 12.30s/it] 43%|████▎ | 715/1680 [2:25:06<3:17:53, 12.30s/it] 43%|████▎ | 716/1680 [2:25:17<3:14:22, 12.10s/it] 43%|████▎ | 717/1680 [2:25:29<3:12:07, 11.97s/it] 43%|████▎ | 718/1680 [2:25:40<3:08:43, 11.77s/it] 43%|████▎ | 719/1680 [2:25:52<3:08:01, 11.74s/it] 43%|████▎ | 720/1680 [2:26:04<3:06:41, 11.67s/it] 43%|████▎ | 720/1680 [2:26:04<3:06:41, 11.67s/it] 43%|████▎ | 721/1680 [2:26:16<3:10:53, 11.94s/it] 43%|████▎ | 722/1680 [2:26:28<3:09:05, 11.84s/it] 43%|████▎ | 723/1680 [2:26:41<3:14:32, 12.20s/it] 43%|████▎ | 724/1680 [2:26:52<3:11:18, 12.01s/it] 43%|████▎ | 725/1680 [2:27:05<3:12:50, 12.12s/it] 43%|████▎ | 725/1680 [2:27:05<3:12:50, 12.12s/it] 43%|████▎ | 726/1680 [2:27:16<3:10:03, 11.95s/it] 43%|████▎ | 727/1680 [2:27:28<3:07:50, 11.83s/it] 43%|████▎ | 728/1680 [2:27:41<3:13:28, 12.19s/it] 43%|████▎ | 729/1680 [2:27:53<3:12:42, 12.16s/it] 43%|████▎ | 730/1680 [2:28:04<3:07:29, 11.84s/it] 43%|████▎ | 730/1680 [2:28:04<3:07:29, 11.84s/it] 44%|████▎ | 731/1680 [2:28:16<3:05:43, 11.74s/it] 44%|████▎ | 732/1680 [2:28:28<3:08:21, 11.92s/it] 44%|████▎ | 733/1680 [2:28:40<3:11:01, 12.10s/it] 44%|████▎ | 734/1680 [2:28:53<3:12:37, 12.22s/it] 44%|████▍ | 735/1680 [2:29:04<3:07:49, 11.93s/it] 44%|████▍ | 735/1680 [2:29:04<3:07:49, 11.93s/it] 44%|████▍ | 736/1680 [2:29:16<3:06:29, 11.85s/it] 44%|████▍ | 737/1680 [2:29:28<3:08:22, 11.99s/it] 44%|████▍ | 738/1680 [2:29:41<3:13:25, 12.32s/it] 44%|████▍ | 739/1680 [2:29:53<3:09:55, 12.11s/it] 44%|████▍ | 740/1680 [2:30:05<3:09:10, 12.08s/it] 44%|████▍ | 740/1680 [2:30:05<3:09:10, 12.08s/it] 44%|████▍ | 741/1680 [2:30:16<3:04:46, 11.81s/it] 44%|████▍ | 742/1680 [2:30:28<3:07:07, 11.97s/it] 44%|████▍ | 743/1680 [2:30:39<3:01:16, 11.61s/it] 44%|████▍ | 744/1680 [2:30:50<2:59:52, 11.53s/it] 44%|████▍ | 745/1680 [2:31:02<3:01:03, 11.62s/it] 44%|████▍ | 745/1680 [2:31:02<3:01:03, 11.62s/it] 44%|████▍ | 746/1680 [2:31:15<3:06:27, 11.98s/it] 44%|████▍ | 747/1680 [2:31:26<3:02:01, 11.71s/it] 45%|████▍ | 748/1680 [2:31:39<3:06:59, 12.04s/it] 45%|████▍ | 749/1680 [2:31:51<3:06:47, 12.04s/it] 45%|████▍ | 750/1680 [2:32:03<3:07:26, 12.09s/it] 45%|████▍ | 750/1680 [2:32:03<3:07:26, 12.09s/it] 45%|████▍ | 751/1680 [2:32:16<3:08:24, 12.17s/it] 45%|████▍ | 752/1680 [2:32:27<3:04:31, 11.93s/it] 45%|████▍ | 753/1680 [2:32:39<3:03:10, 11.86s/it] 45%|████▍ | 754/1680 [2:32:50<3:00:43, 11.71s/it] 45%|████▍ | 755/1680 [2:33:01<2:58:32, 11.58s/it] 45%|████▍ | 755/1680 [2:33:01<2:58:32, 11.58s/it] 45%|████▌ | 756/1680 [2:33:13<3:00:54, 11.75s/it] 45%|████▌ | 757/1680 [2:33:24<2:56:01, 11.44s/it] 45%|████▌ | 758/1680 [2:33:36<2:57:04, 11.52s/it] 45%|████▌ | 759/1680 [2:33:48<2:58:57, 11.66s/it] 45%|████▌ | 760/1680 [2:33:59<2:56:41, 11.52s/it] 45%|████▌ | 760/1680 [2:33:59<2:56:41, 11.52s/it] 45%|████▌ | 761/1680 [2:34:12<3:03:05, 11.95s/it] 45%|████▌ | 762/1680 [2:34:24<3:01:24, 11.86s/it] 45%|████▌ | 763/1680 [2:34:36<3:02:55, 11.97s/it] 45%|████▌ | 764/1680 [2:34:48<3:01:25, 11.88s/it] 46%|████▌ | 765/1680 [2:35:00<3:01:31, 11.90s/it] 46%|████▌ | 765/1680 [2:35:00<3:01:31, 11.90s/it] 46%|████▌ | 766/1680 [2:35:11<3:00:12, 11.83s/it] 46%|████▌ | 767/1680 [2:35:23<3:01:01, 11.90s/it] 46%|████▌ | 768/1680 [2:35:35<2:59:54, 11.84s/it] 46%|████▌ | 769/1680 [2:35:47<2:59:47, 11.84s/it] 46%|████▌ | 770/1680 [2:35:59<2:59:50, 11.86s/it] 46%|████▌ | 770/1680 [2:35:59<2:59:50, 11.86s/it][INFO|trainer.py:3819] 2024-08-13 12:12:27,176 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 12:12:27,176 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 12:12:27,176 >> Batch size = 1 {'eval_loss': 1.4858874082565308, 'eval_runtime': 17.7385, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 2.5} {'loss': 0.5921, 'grad_norm': 3.006899833679199, 'learning_rate': 7.197456499023225e-05, 'epoch': 2.52} {'loss': 0.5873, 'grad_norm': 2.9739573001861572, 'learning_rate': 7.150679930976825e-05, 'epoch': 2.53} {'loss': 0.6661, 'grad_norm': 3.7028846740722656, 'learning_rate': 7.103671245342887e-05, 'epoch': 2.55} {'loss': 0.5388, 'grad_norm': 3.090599775314331, 'learning_rate': 7.056435515653059e-05, 'epoch': 2.57} {'loss': 0.6641, 'grad_norm': 2.799252986907959, 'learning_rate': 7.008977839943299e-05, 'epoch': 2.59} {'loss': 0.6221, 'grad_norm': 2.8093032836914062, 'learning_rate': 6.961303340203653e-05, 'epoch': 2.61} {'loss': 0.599, 'grad_norm': 3.6351985931396484, 'learning_rate': 6.91341716182545e-05, 'epoch': 2.62} {'loss': 0.6047, 'grad_norm': 2.6190829277038574, 'learning_rate': 6.86532447304597e-05, 'epoch': 2.64} {'loss': 0.614, 'grad_norm': 3.227262020111084, 'learning_rate': 6.817030464390656e-05, 'epoch': 2.66} {'loss': 0.6367, 'grad_norm': 2.5810439586639404, 'learning_rate': 6.768540348112907e-05, 'epoch': 2.68} {'loss': 0.5681, 'grad_norm': 3.030888557434082, 'learning_rate': 6.719859357631535e-05, 'epoch': 2.7} {'loss': 0.5723, 'grad_norm': 3.1176657676696777, 'learning_rate': 6.670992746965938e-05, 'epoch': 2.71} {'loss': 0.6385, 'grad_norm': 3.0151100158691406, 'learning_rate': 6.621945790169036e-05, 'epoch': 2.73} {'loss': 0.6665, 'grad_norm': 3.4799766540527344, 'learning_rate': 6.572723780758069e-05, 'epoch': 2.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-770 [INFO|configuration_utils.py:733] 2024-08-13 12:12:45,548 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 12:12:45,548 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 12:12:45,774 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-770/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 12:12:45,775 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-770/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 46%|████▌ | 771/1680 [2:36:30<4:27:53, 17.68s/it] 46%|████▌ | 772/1680 [2:36:42<4:01:20, 15.95s/it] 46%|████▌ | 773/1680 [2:36:54<3:42:57, 14.75s/it] 46%|████▌ | 774/1680 [2:37:06<3:30:34, 13.95s/it] 46%|████▌ | 775/1680 [2:37:17<3:19:16, 13.21s/it] 46%|████▌ | 775/1680 [2:37:17<3:19:16, 13.21s/it] 46%|████▌ | 776/1680 [2:37:29<3:11:44, 12.73s/it] 46%|████▋ | 777/1680 [2:37:40<3:05:40, 12.34s/it] 46%|████▋ | 778/1680 [2:37:52<3:02:18, 12.13s/it] 46%|████▋ | 779/1680 [2:38:04<3:02:42, 12.17s/it] 46%|████▋ | 780/1680 [2:38:17<3:04:18, 12.29s/it] 46%|████▋ | 780/1680 [2:38:17<3:04:18, 12.29s/it] 46%|████▋ | 781/1680 [2:38:28<2:59:55, 12.01s/it] 47%|████▋ | 782/1680 [2:38:40<2:59:32, 12.00s/it] 47%|████▋ | 783/1680 [2:38:52<2:57:09, 11.85s/it] 47%|████▋ | 784/1680 [2:39:03<2:53:01, 11.59s/it] 47%|████▋ | 785/1680 [2:39:15<2:54:36, 11.71s/it] 47%|████▋ | 785/1680 [2:39:15<2:54:36, 11.71s/it] 47%|████▋ | 786/1680 [2:39:27<2:55:04, 11.75s/it] 47%|████▋ | 787/1680 [2:39:38<2:55:29, 11.79s/it] 47%|████▋ | 788/1680 [2:39:50<2:52:13, 11.59s/it] 47%|████▋ | 789/1680 [2:40:02<2:56:51, 11.91s/it] 47%|████▋ | 790/1680 [2:40:14<2:55:28, 11.83s/it] 47%|████▋ | 790/1680 [2:40:14<2:55:28, 11.83s/it] 47%|████▋ | 791/1680 [2:40:25<2:53:06, 11.68s/it] 47%|████▋ | 792/1680 [2:40:37<2:53:01, 11.69s/it] 47%|████▋ | 793/1680 [2:40:49<2:56:19, 11.93s/it] 47%|████▋ | 794/1680 [2:41:02<2:58:59, 12.12s/it] 47%|████▋ | 795/1680 [2:41:14<2:58:13, 12.08s/it] 47%|████▋ | 795/1680 [2:41:14<2:58:13, 12.08s/it] 47%|████▋ | 796/1680 [2:41:26<2:57:43, 12.06s/it] 47%|████▋ | 797/1680 [2:41:38<2:55:57, 11.96s/it] 48%|████▊ | 798/1680 [2:41:49<2:55:09, 11.92s/it] 48%|████▊ | 799/1680 [2:42:01<2:54:37, 11.89s/it] 48%|████▊ | 800/1680 [2:42:12<2:50:25, 11.62s/it] 48%|████▊ | 800/1680 [2:42:12<2:50:25, 11.62s/it] 48%|████▊ | 801/1680 [2:42:25<2:53:01, 11.81s/it] 48%|████▊ | 802/1680 [2:42:36<2:52:18, 11.77s/it] 48%|████▊ | 803/1680 [2:42:49<2:55:44, 12.02s/it] 48%|████▊ | 804/1680 [2:43:01<2:57:54, 12.19s/it] 48%|████▊ | 805/1680 [2:43:14<2:58:52, 12.27s/it] 48%|████▊ | 805/1680 [2:43:14<2:58:52, 12.27s/it] 48%|████▊ | 806/1680 [2:43:26<2:56:11, 12.10s/it] 48%|████▊ | 807/1680 [2:43:38<2:57:17, 12.19s/it] 48%|████▊ | 808/1680 [2:43:50<2:58:12, 12.26s/it] 48%|████▊ | 809/1680 [2:44:02<2:55:08, 12.07s/it] 48%|████▊ | 810/1680 [2:44:14<2:54:35, 12.04s/it] 48%|████▊ | 810/1680 [2:44:14<2:54:35, 12.04s/it] 48%|████▊ | 811/1680 [2:44:26<2:54:23, 12.04s/it] 48%|████▊ | 812/1680 [2:44:37<2:48:47, 11.67s/it] 48%|████▊ | 813/1680 [2:44:49<2:49:35, 11.74s/it] 48%|████▊ | 814/1680 [2:45:00<2:47:53, 11.63s/it] 49%|████▊ | 815/1680 [2:45:12<2:50:02, 11.79s/it] 49%|████▊ | 815/1680 [2:45:12<2:50:02, 11.79s/it] 49%|████▊ | 816/1680 [2:45:23<2:47:06, 11.60s/it] 49%|████▊ | 817/1680 [2:45:35<2:48:40, 11.73s/it] 49%|████▊ | 818/1680 [2:45:47<2:48:10, 11.71s/it] 49%|████▉ | 819/1680 [2:45:59<2:50:44, 11.90s/it] 49%|████▉ | 820/1680 [2:46:12<2:52:09, 12.01s/it] 49%|████▉ | 820/1680 [2:46:12<2:52:09, 12.01s/it] 49%|████▉ | 821/1680 [2:46:23<2:48:52, 11.80s/it] 49%|████▉ | 822/1680 [2:46:34<2:46:05, 11.61s/it] 49%|████▉ | 823/1680 [2:46:46<2:47:27, 11.72s/it] 49%|████▉ | 824/1680 [2:46:58<2:48:22, 11.80s/it] 49%|████▉ | 825/1680 [2:47:11<2:51:12, 12.01s/it] 49%|████▉ | 825/1680 [2:47:11<2:51:12, 12.01s/it] 49%|████▉ | 826/1680 [2:47:23<2:52:43, 12.14s/it] 49%|████▉ | 827/1680 [2:47:35<2:52:33, 12.14s/it] 49%|████▉ | 828/1680 [2:47:46<2:48:16, 11.85s/it] 49%|████▉ | 829/1680 [2:47:59<2:52:14, 12.14s/it] 49%|████▉ | 830/1680 [2:48:11<2:50:06, 12.01s/it] 49%|████▉ | 830/1680 [2:48:11<2:50:06, 12.01s/it] 49%|████▉ | 831/1680 [2:48:23<2:48:30, 11.91s/it] 50%|████▉ | 832/1680 [2:48:35<2:49:57, 12.03s/it] 50%|████▉ | 833/1680 [2:48:47<2:48:58, 11.97s/it] 50%|████▉ | 834/1680 [2:48:59<2:50:15, 12.08s/it] 50%|████▉ | 835/1680 [2:49:10<2:46:52, 11.85s/it] 50%|████▉ | 835/1680 [2:49:10<2:46:52, 11.85s/it] 50%|████▉ | 836/1680 [2:49:23<2:48:53, 12.01s/it] 50%|████▉ | 837/1680 [2:49:35<2:48:11, 11.97s/it] 50%|████▉ | 838/1680 [2:49:47<2:47:51, 11.96s/it] 50%|████▉ | 839/1680 [2:49:59<2:48:06, 11.99s/it] 50%|█████ | 840/1680 [2:50:11<2:48:35, 12.04s/it] 50%|█████ | 840/1680 [2:50:11<2:48:35, 12.04s/it][INFO|trainer.py:3819] 2024-08-13 12:26:39,310 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 12:26:39,310 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 12:26:39,310 >> Batch size = 1 {'eval_loss': 1.5236101150512695, 'eval_runtime': 17.7462, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 2.75} {'loss': 0.6083, 'grad_norm': 3.1448163986206055, 'learning_rate': 6.523332031143272e-05, 'epoch': 2.77} {'loss': 0.6493, 'grad_norm': 2.874833106994629, 'learning_rate': 6.473775872054521e-05, 'epoch': 2.78} {'loss': 0.5722, 'grad_norm': 3.2550127506256104, 'learning_rate': 6.424060651966007e-05, 'epoch': 2.8} {'loss': 0.611, 'grad_norm': 3.066908121109009, 'learning_rate': 6.374191736518974e-05, 'epoch': 2.82} {'loss': 0.6202, 'grad_norm': 3.05871319770813, 'learning_rate': 6.324174507942637e-05, 'epoch': 2.84} {'loss': 0.5593, 'grad_norm': 3.2599833011627197, 'learning_rate': 6.274014364473274e-05, 'epoch': 2.86} {'loss': 0.7415, 'grad_norm': 2.897418260574341, 'learning_rate': 6.22371671977162e-05, 'epoch': 2.87} {'loss': 0.6544, 'grad_norm': 3.032317876815796, 'learning_rate': 6.173287002338577e-05, 'epoch': 2.89} {'loss': 0.6421, 'grad_norm': 2.7111008167266846, 'learning_rate': 6.122730654929334e-05, 'epoch': 2.91} {'loss': 0.6332, 'grad_norm': 2.7735886573791504, 'learning_rate': 6.072053133965938e-05, 'epoch': 2.93} {'loss': 0.6508, 'grad_norm': 3.4417500495910645, 'learning_rate': 6.021259908948402e-05, 'epoch': 2.95} {'loss': 0.621, 'grad_norm': 3.432999849319458, 'learning_rate': 5.970356461864391e-05, 'epoch': 2.96} {'loss': 0.6347, 'grad_norm': 3.470132827758789, 'learning_rate': 5.919348286597569e-05, 'epoch': 2.98} {'loss': 0.6101, 'grad_norm': 3.153116226196289, 'learning_rate': 5.868240888334653e-05, 'epoch': 3.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-840 [INFO|configuration_utils.py:733] 2024-08-13 12:26:57,604 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 12:26:57,605 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 12:26:57,830 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-840/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 12:26:57,831 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-840/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 50%|█████ | 841/1680 [2:50:42<4:10:26, 17.91s/it] 50%|█████ | 842/1680 [2:50:55<3:46:27, 16.21s/it] 50%|█████ | 843/1680 [2:51:07<3:30:53, 15.12s/it] 50%|█████ | 844/1680 [2:51:19<3:16:47, 14.12s/it] 50%|█████ | 845/1680 [2:51:31<3:06:37, 13.41s/it] 50%|█████ | 845/1680 [2:51:31<3:06:37, 13.41s/it] 50%|█████ | 846/1680 [2:51:42<2:58:51, 12.87s/it] 50%|█████ | 847/1680 [2:51:54<2:51:57, 12.39s/it] 50%|█████ | 848/1680 [2:52:05<2:48:43, 12.17s/it] 51%|█████ | 849/1680 [2:52:17<2:46:27, 12.02s/it] 51%|█████ | 850/1680 [2:52:28<2:43:32, 11.82s/it] 51%|█████ | 850/1680 [2:52:28<2:43:32, 11.82s/it] 51%|█████ | 851/1680 [2:52:40<2:42:56, 11.79s/it] 51%|█████ | 852/1680 [2:52:53<2:46:00, 12.03s/it] 51%|█████ | 853/1680 [2:53:03<2:40:48, 11.67s/it] 51%|█████ | 854/1680 [2:53:16<2:43:12, 11.85s/it] 51%|█████ | 855/1680 [2:53:28<2:43:48, 11.91s/it] 51%|█████ | 855/1680 [2:53:28<2:43:48, 11.91s/it] 51%|█████ | 856/1680 [2:53:40<2:45:30, 12.05s/it] 51%|█████ | 857/1680 [2:53:53<2:47:02, 12.18s/it] 51%|█████ | 858/1680 [2:54:04<2:44:53, 12.04s/it] 51%|█████ | 859/1680 [2:54:16<2:44:18, 12.01s/it] 51%|█████ | 860/1680 [2:54:28<2:44:02, 12.00s/it] 51%|█████ | 860/1680 [2:54:28<2:44:02, 12.00s/it] 51%|█████▏ | 861/1680 [2:54:40<2:42:44, 11.92s/it] 51%|█████▏ | 862/1680 [2:54:52<2:42:40, 11.93s/it] 51%|█████▏ | 863/1680 [2:55:04<2:42:18, 11.92s/it] 51%|█████▏ | 864/1680 [2:55:16<2:41:37, 11.88s/it] 51%|█████▏ | 865/1680 [2:55:28<2:41:32, 11.89s/it] 51%|█████▏ | 865/1680 [2:55:28<2:41:32, 11.89s/it] 52%|█████▏ | 866/1680 [2:55:40<2:41:39, 11.92s/it] 52%|█████▏ | 867/1680 [2:55:51<2:40:41, 11.86s/it] 52%|█████▏ | 868/1680 [2:56:03<2:41:07, 11.91s/it] 52%|█████▏ | 869/1680 [2:56:16<2:42:45, 12.04s/it] 52%|█████▏ | 870/1680 [2:56:27<2:41:35, 11.97s/it] 52%|█████▏ | 870/1680 [2:56:27<2:41:35, 11.97s/it] 52%|█████▏ | 871/1680 [2:56:39<2:38:36, 11.76s/it] 52%|█████▏ | 872/1680 [2:56:51<2:40:50, 11.94s/it] 52%|█████▏ | 873/1680 [2:57:03<2:39:04, 11.83s/it] 52%|█████▏ | 874/1680 [2:57:14<2:38:35, 11.81s/it] 52%|█████▏ | 875/1680 [2:57:26<2:38:48, 11.84s/it] 52%|█████▏ | 875/1680 [2:57:26<2:38:48, 11.84s/it] 52%|█████▏ | 876/1680 [2:57:39<2:40:09, 11.95s/it] 52%|█████▏ | 877/1680 [2:57:51<2:42:40, 12.15s/it] 52%|█████▏ | 878/1680 [2:58:03<2:42:52, 12.18s/it] 52%|█████▏ | 879/1680 [2:58:15<2:40:21, 12.01s/it] 52%|█████▏ | 880/1680 [2:58:27<2:38:34, 11.89s/it] 52%|█████▏ | 880/1680 [2:58:27<2:38:34, 11.89s/it] 52%|█████▏ | 881/1680 [2:58:38<2:35:08, 11.65s/it] 52%|█████▎ | 882/1680 [2:58:49<2:32:19, 11.45s/it] 53%|█████▎ | 883/1680 [2:59:01<2:36:04, 11.75s/it] 53%|█████▎ | 884/1680 [2:59:13<2:37:36, 11.88s/it] 53%|█████▎ | 885/1680 [2:59:25<2:37:27, 11.88s/it] 53%|█████▎ | 885/1680 [2:59:25<2:37:27, 11.88s/it] 53%|█████▎ | 886/1680 [2:59:36<2:34:26, 11.67s/it] 53%|█████▎ | 887/1680 [2:59:49<2:36:36, 11.85s/it] 53%|█████▎ | 888/1680 [3:00:01<2:37:22, 11.92s/it] 53%|█████▎ | 889/1680 [3:00:13<2:36:49, 11.90s/it] 53%|█████▎ | 890/1680 [3:00:24<2:33:33, 11.66s/it] 53%|█████▎ | 890/1680 [3:00:24<2:33:33, 11.66s/it] 53%|█████▎ | 891/1680 [3:00:35<2:32:30, 11.60s/it] 53%|█████▎ | 892/1680 [3:00:47<2:32:29, 11.61s/it] 53%|█████▎ | 893/1680 [3:00:59<2:35:31, 11.86s/it] 53%|█████▎ | 894/1680 [3:01:11<2:35:53, 11.90s/it] 53%|█████▎ | 895/1680 [3:01:24<2:38:10, 12.09s/it] 53%|█████▎ | 895/1680 [3:01:24<2:38:10, 12.09s/it] 53%|█████▎ | 896/1680 [3:01:36<2:36:45, 12.00s/it] 53%|█████▎ | 897/1680 [3:01:47<2:34:19, 11.83s/it] 53%|█████▎ | 898/1680 [3:02:00<2:37:13, 12.06s/it] 54%|█████▎ | 899/1680 [3:02:11<2:34:49, 11.89s/it] 54%|█████▎ | 900/1680 [3:02:22<2:32:06, 11.70s/it] 54%|█████▎ | 900/1680 [3:02:22<2:32:06, 11.70s/it] 54%|█████▎ | 901/1680 [3:02:35<2:34:50, 11.93s/it] 54%|█████▎ | 902/1680 [3:02:47<2:34:58, 11.95s/it] 54%|█████▍ | 903/1680 [3:02:59<2:36:04, 12.05s/it] 54%|█████▍ | 904/1680 [3:03:11<2:34:19, 11.93s/it] 54%|█████▍ | 905/1680 [3:03:23<2:36:20, 12.10s/it] 54%|█████▍ | 905/1680 [3:03:23<2:36:20, 12.10s/it] 54%|█████▍ | 906/1680 [3:03:35<2:34:43, 11.99s/it] 54%|█████▍ | 907/1680 [3:03:47<2:35:24, 12.06s/it] 54%|█████▍ | 908/1680 [3:03:59<2:33:23, 11.92s/it] 54%|█████▍ | 909/1680 [3:04:11<2:33:38, 11.96s/it] 54%|█████▍ | 910/1680 [3:04:23<2:34:20, 12.03s/it] 54%|█████▍ | 910/1680 [3:04:23<2:34:20, 12.03s/it][INFO|trainer.py:3819] 2024-08-13 12:40:51,546 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 12:40:51,546 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 12:40:51,546 >> Batch size = 1 {'eval_loss': 1.5220016241073608, 'eval_runtime': 17.7399, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 3.0} {'loss': 0.4183, 'grad_norm': 2.5395278930664062, 'learning_rate': 5.8170397829712485e-05, 'epoch': 3.02} {'loss': 0.1667, 'grad_norm': 2.833970308303833, 'learning_rate': 5.765750496516547e-05, 'epoch': 3.03} {'loss': 0.255, 'grad_norm': 3.447057008743286, 'learning_rate': 5.714378564496901e-05, 'epoch': 3.05} {'loss': 0.2424, 'grad_norm': 3.9993224143981934, 'learning_rate': 5.6629295313583974e-05, 'epoch': 3.07} {'loss': 0.2097, 'grad_norm': 3.626281499862671, 'learning_rate': 5.611408949868457e-05, 'epoch': 3.09} {'loss': 0.2271, 'grad_norm': 2.693284034729004, 'learning_rate': 5.559822380516539e-05, 'epoch': 3.11} {'loss': 0.1982, 'grad_norm': 2.439389705657959, 'learning_rate': 5.5081753909140096e-05, 'epoch': 3.12} {'loss': 0.2192, 'grad_norm': 2.6163575649261475, 'learning_rate': 5.456473555193242e-05, 'epoch': 3.14} {'loss': 0.2097, 'grad_norm': 2.405829668045044, 'learning_rate': 5.404722453406017e-05, 'epoch': 3.16} {'loss': 0.2213, 'grad_norm': 2.819413423538208, 'learning_rate': 5.3529276709212816e-05, 'epoch': 3.18} {'loss': 0.2559, 'grad_norm': 3.6370203495025635, 'learning_rate': 5.30109479782233e-05, 'epoch': 3.2} {'loss': 0.1955, 'grad_norm': 3.4090726375579834, 'learning_rate': 5.249229428303486e-05, 'epoch': 3.21} {'loss': 0.2642, 'grad_norm': 2.8171908855438232, 'learning_rate': 5.197337160066331e-05, 'epoch': 3.23} {'loss': 0.2467, 'grad_norm': 3.926447629928589, 'learning_rate': 5.145423593715557e-05, 'epoch': 3.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-910 [INFO|configuration_utils.py:733] 2024-08-13 12:41:09,835 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 12:41:09,836 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 12:41:10,061 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-910/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 12:41:10,062 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-910/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 54%|█████▍ | 911/1680 [3:04:54<3:47:53, 17.78s/it] 54%|█████▍ | 912/1680 [3:05:05<3:21:35, 15.75s/it] 54%|█████▍ | 913/1680 [3:05:17<3:07:08, 14.64s/it] 54%|█████▍ | 914/1680 [3:05:30<3:00:08, 14.11s/it] 54%|█████▍ | 915/1680 [3:05:43<2:54:32, 13.69s/it] 54%|█████▍ | 915/1680 [3:05:43<2:54:32, 13.69s/it] 55%|█████▍ | 916/1680 [3:05:55<2:48:29, 13.23s/it] 55%|█████▍ | 917/1680 [3:06:08<2:45:27, 13.01s/it] 55%|█████▍ | 918/1680 [3:06:19<2:39:48, 12.58s/it] 55%|█████▍ | 919/1680 [3:06:30<2:34:35, 12.19s/it] 55%|█████▍ | 920/1680 [3:06:42<2:32:15, 12.02s/it] 55%|█████▍ | 920/1680 [3:06:42<2:32:15, 12.02s/it] 55%|█████▍ | 921/1680 [3:06:54<2:32:26, 12.05s/it] 55%|█████▍ | 922/1680 [3:07:06<2:30:36, 11.92s/it] 55%|█████▍ | 923/1680 [3:07:18<2:31:04, 11.97s/it] 55%|█████▌ | 924/1680 [3:07:31<2:34:52, 12.29s/it] 55%|█████▌ | 925/1680 [3:07:43<2:34:40, 12.29s/it] 55%|█████▌ | 925/1680 [3:07:43<2:34:40, 12.29s/it] 55%|█████▌ | 926/1680 [3:07:55<2:33:24, 12.21s/it] 55%|█████▌ | 927/1680 [3:08:07<2:32:46, 12.17s/it] 55%|█████▌ | 928/1680 [3:08:20<2:34:40, 12.34s/it] 55%|█████▌ | 929/1680 [3:08:32<2:33:29, 12.26s/it] 55%|█████▌ | 930/1680 [3:08:44<2:32:10, 12.17s/it] 55%|█████▌ | 930/1680 [3:08:44<2:32:10, 12.17s/it] 55%|█████▌ | 931/1680 [3:08:56<2:30:45, 12.08s/it] 55%|█████▌ | 932/1680 [3:09:07<2:28:14, 11.89s/it] 56%|█████▌ | 933/1680 [3:09:20<2:29:32, 12.01s/it] 56%|█████▌ | 934/1680 [3:09:32<2:30:05, 12.07s/it] 56%|█████▌ | 935/1680 [3:09:45<2:32:43, 12.30s/it] 56%|█████▌ | 935/1680 [3:09:45<2:32:43, 12.30s/it] 56%|█████▌ | 936/1680 [3:09:57<2:31:40, 12.23s/it] 56%|█████▌ | 937/1680 [3:10:09<2:31:28, 12.23s/it] 56%|█████▌ | 938/1680 [3:10:20<2:28:06, 11.98s/it] 56%|█████▌ | 939/1680 [3:10:32<2:27:12, 11.92s/it] 56%|█████▌ | 940/1680 [3:10:44<2:28:15, 12.02s/it] 56%|█████▌ | 940/1680 [3:10:44<2:28:15, 12.02s/it] 56%|█████▌ | 941/1680 [3:10:57<2:28:42, 12.07s/it] 56%|█████▌ | 942/1680 [3:11:09<2:28:59, 12.11s/it] 56%|█████▌ | 943/1680 [3:11:21<2:27:55, 12.04s/it] 56%|█████▌ | 944/1680 [3:11:33<2:27:40, 12.04s/it] 56%|█████▋ | 945/1680 [3:11:45<2:28:53, 12.15s/it] 56%|█████▋ | 945/1680 [3:11:45<2:28:53, 12.15s/it] 56%|█████▋ | 946/1680 [3:11:57<2:26:30, 11.98s/it] 56%|█████▋ | 947/1680 [3:12:08<2:25:05, 11.88s/it] 56%|█████▋ | 948/1680 [3:12:20<2:25:15, 11.91s/it] 56%|█████▋ | 949/1680 [3:12:31<2:21:55, 11.65s/it] 57%|█████▋ | 950/1680 [3:12:43<2:22:06, 11.68s/it] 57%|█████▋ | 950/1680 [3:12:43<2:22:06, 11.68s/it] 57%|█████▋ | 951/1680 [3:12:56<2:25:22, 11.96s/it] 57%|█████▋ | 952/1680 [3:13:09<2:28:37, 12.25s/it] 57%|█████▋ | 953/1680 [3:13:20<2:25:35, 12.02s/it] 57%|█████▋ | 954/1680 [3:13:32<2:24:13, 11.92s/it] 57%|█████▋ | 955/1680 [3:13:43<2:21:36, 11.72s/it] 57%|█████▋ | 955/1680 [3:13:43<2:21:36, 11.72s/it] 57%|█████▋ | 956/1680 [3:13:55<2:22:25, 11.80s/it] 57%|█████▋ | 957/1680 [3:14:07<2:21:52, 11.77s/it] 57%|█████▋ | 958/1680 [3:14:19<2:21:26, 11.75s/it] 57%|█████▋ | 959/1680 [3:14:30<2:20:03, 11.66s/it] 57%|█████▋ | 960/1680 [3:14:43<2:23:39, 11.97s/it] 57%|█████▋ | 960/1680 [3:14:43<2:23:39, 11.97s/it] 57%|█████▋ | 961/1680 [3:14:55<2:24:52, 12.09s/it] 57%|█████▋ | 962/1680 [3:15:07<2:23:53, 12.02s/it] 57%|█████▋ | 963/1680 [3:15:19<2:23:38, 12.02s/it] 57%|█████▋ | 964/1680 [3:15:31<2:25:03, 12.16s/it] 57%|█████▋ | 965/1680 [3:15:44<2:24:40, 12.14s/it] 57%|█████▋ | 965/1680 [3:15:44<2:24:40, 12.14s/it] 57%|█████▊ | 966/1680 [3:15:55<2:22:28, 11.97s/it] 58%|█████▊ | 967/1680 [3:16:08<2:24:44, 12.18s/it] 58%|█████▊ | 968/1680 [3:16:20<2:23:52, 12.12s/it] 58%|█████▊ | 969/1680 [3:16:32<2:23:59, 12.15s/it] 58%|█████▊ | 970/1680 [3:16:45<2:26:35, 12.39s/it] 58%|█████▊ | 970/1680 [3:16:45<2:26:35, 12.39s/it] 58%|█████▊ | 971/1680 [3:16:57<2:25:18, 12.30s/it] 58%|█████▊ | 972/1680 [3:17:10<2:26:04, 12.38s/it] 58%|█████▊ | 973/1680 [3:17:22<2:24:42, 12.28s/it] 58%|█████▊ | 974/1680 [3:17:34<2:23:30, 12.20s/it] 58%|█████▊ | 975/1680 [3:17:45<2:20:14, 11.94s/it] 58%|█████▊ | 975/1680 [3:17:45<2:20:14, 11.94s/it] 58%|█████▊ | 976/1680 [3:17:58<2:22:23, 12.14s/it] 58%|█████▊ | 977/1680 [3:18:08<2:17:17, 11.72s/it] 58%|█████▊ | 978/1680 [3:18:19<2:14:49, 11.52s/it] 58%|█████▊ | 979/1680 [3:18:31<2:14:28, 11.51s/it] 58%|█████▊ | 980/1680 [3:18:43<2:15:44, 11.63s/it] 58%|█████▊ | 980/1680 [3:18:43<2:15:44, 11.63s/it][INFO|trainer.py:3819] 2024-08-13 12:55:11,257 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 12:55:11,257 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 12:55:11,257 >> Batch size = 1 {'eval_loss': 1.8390079736709595, 'eval_runtime': 17.7348, 'eval_samples_per_second': 2.594, 'eval_steps_per_second': 2.594, 'epoch': 3.25} {'loss': 0.2239, 'grad_norm': 2.7143030166625977, 'learning_rate': 5.0934943321545115e-05, 'epoch': 3.27} {'loss': 0.1545, 'grad_norm': 2.717496871948242, 'learning_rate': 5.041554979980486e-05, 'epoch': 3.28} {'loss': 0.2819, 'grad_norm': 3.516397714614868, 'learning_rate': 4.9896111428798254e-05, 'epoch': 3.3} {'loss': 0.3043, 'grad_norm': 3.3290677070617676, 'learning_rate': 4.9376684270229254e-05, 'epoch': 3.32} {'loss': 0.2494, 'grad_norm': 2.914736032485962, 'learning_rate': 4.8857324384591653e-05, 'epoch': 3.34} {'loss': 0.2271, 'grad_norm': 3.37791109085083, 'learning_rate': 4.8338087825118675e-05, 'epoch': 3.36} {'loss': 0.242, 'grad_norm': 3.295100688934326, 'learning_rate': 4.781903063173321e-05, 'epoch': 3.37} {'loss': 0.2244, 'grad_norm': 2.5792458057403564, 'learning_rate': 4.730020882499964e-05, 'epoch': 3.39} {'loss': 0.2552, 'grad_norm': 3.0014591217041016, 'learning_rate': 4.678167840007767e-05, 'epoch': 3.41} {'loss': 0.2542, 'grad_norm': 3.207282066345215, 'learning_rate': 4.626349532067879e-05, 'epoch': 3.43} {'loss': 0.3249, 'grad_norm': 3.85109543800354, 'learning_rate': 4.574571551302647e-05, 'epoch': 3.44} {'loss': 0.2729, 'grad_norm': 3.3335843086242676, 'learning_rate': 4.522839485981994e-05, 'epoch': 3.46} {'loss': 0.2595, 'grad_norm': 2.885708808898926, 'learning_rate': 4.471158919420312e-05, 'epoch': 3.48} {'loss': 0.2284, 'grad_norm': 3.215789556503296, 'learning_rate': 4.4195354293738484e-05, 'epoch': 3.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-980 [INFO|configuration_utils.py:733] 2024-08-13 12:55:29,698 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 12:55:29,699 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 12:55:29,927 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-980/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 12:55:29,928 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-980/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 58%|█████▊ | 981/1680 [3:19:13<3:22:08, 17.35s/it] 58%|█████▊ | 982/1680 [3:19:25<3:03:05, 15.74s/it] 59%|█████▊ | 983/1680 [3:19:37<2:48:31, 14.51s/it] 59%|█████▊ | 984/1680 [3:19:49<2:39:34, 13.76s/it] 59%|█████▊ | 985/1680 [3:20:01<2:31:59, 13.12s/it] 59%|█████▊ | 985/1680 [3:20:01<2:31:59, 13.12s/it] 59%|█████▊ | 986/1680 [3:20:13<2:29:11, 12.90s/it] 59%|█████▉ | 987/1680 [3:20:24<2:23:09, 12.39s/it] 59%|█████▉ | 988/1680 [3:20:37<2:23:03, 12.40s/it] 59%|█████▉ | 989/1680 [3:20:49<2:21:02, 12.25s/it] 59%|█████▉ | 990/1680 [3:21:01<2:20:15, 12.20s/it] 59%|█████▉ | 990/1680 [3:21:01<2:20:15, 12.20s/it] 59%|█████▉ | 991/1680 [3:21:12<2:18:07, 12.03s/it] 59%|█████▉ | 992/1680 [3:21:24<2:16:39, 11.92s/it] 59%|█████▉ | 993/1680 [3:21:36<2:18:21, 12.08s/it] 59%|█████▉ | 994/1680 [3:21:48<2:16:58, 11.98s/it] 59%|█████▉ | 995/1680 [3:22:00<2:15:26, 11.86s/it] 59%|█████▉ | 995/1680 [3:22:00<2:15:26, 11.86s/it] 59%|█████▉ | 996/1680 [3:22:13<2:19:30, 12.24s/it] 59%|█████▉ | 997/1680 [3:22:25<2:19:02, 12.22s/it] 59%|█████▉ | 998/1680 [3:22:37<2:18:32, 12.19s/it] 59%|█████▉ | 999/1680 [3:22:50<2:18:47, 12.23s/it] 60%|█████▉ | 1000/1680 [3:23:01<2:16:00, 12.00s/it] 60%|█████▉ | 1000/1680 [3:23:01<2:16:00, 12.00s/it] 60%|█████▉ | 1001/1680 [3:23:13<2:17:27, 12.15s/it] 60%|█████▉ | 1002/1680 [3:23:26<2:17:04, 12.13s/it] 60%|█████▉ | 1003/1680 [3:23:37<2:13:09, 11.80s/it] 60%|█████▉ | 1004/1680 [3:23:48<2:09:56, 11.53s/it] 60%|█████▉ | 1005/1680 [3:23:59<2:09:35, 11.52s/it] 60%|█████▉ | 1005/1680 [3:23:59<2:09:35, 11.52s/it] 60%|█████▉ | 1006/1680 [3:24:12<2:14:00, 11.93s/it] 60%|█████▉ | 1007/1680 [3:24:24<2:12:48, 11.84s/it] 60%|██████ | 1008/1680 [3:24:36<2:13:20, 11.90s/it] 60%|██████ | 1009/1680 [3:24:47<2:12:58, 11.89s/it] 60%|██████ | 1010/1680 [3:25:00<2:13:38, 11.97s/it] 60%|██████ | 1010/1680 [3:25:00<2:13:38, 11.97s/it] 60%|██████ | 1011/1680 [3:25:12<2:14:01, 12.02s/it] 60%|██████ | 1012/1680 [3:25:24<2:15:05, 12.13s/it] 60%|██████ | 1013/1680 [3:25:36<2:14:29, 12.10s/it] 60%|██████ | 1014/1680 [3:25:48<2:12:45, 11.96s/it] 60%|██████ | 1015/1680 [3:25:59<2:08:47, 11.62s/it] 60%|██████ | 1015/1680 [3:25:59<2:08:47, 11.62s/it] 60%|██████ | 1016/1680 [3:26:11<2:12:38, 11.99s/it] 61%|██████ | 1017/1680 [3:26:23<2:10:01, 11.77s/it] 61%|██████ | 1018/1680 [3:26:36<2:13:29, 12.10s/it] 61%|██████ | 1019/1680 [3:26:47<2:11:51, 11.97s/it] 61%|██████ | 1020/1680 [3:26:59<2:09:26, 11.77s/it] 61%|██████ | 1020/1680 [3:26:59<2:09:26, 11.77s/it] 61%|██████ | 1021/1680 [3:27:10<2:07:47, 11.64s/it] 61%|██████ | 1022/1680 [3:27:22<2:10:04, 11.86s/it] 61%|██████ | 1023/1680 [3:27:34<2:10:13, 11.89s/it] 61%|██████ | 1024/1680 [3:27:46<2:10:41, 11.95s/it] 61%|██████ | 1025/1680 [3:27:58<2:09:35, 11.87s/it] 61%|██████ | 1025/1680 [3:27:58<2:09:35, 11.87s/it] 61%|██████ | 1026/1680 [3:28:10<2:08:50, 11.82s/it] 61%|██████ | 1027/1680 [3:28:22<2:09:58, 11.94s/it] 61%|██████ | 1028/1680 [3:28:34<2:10:24, 12.00s/it] 61%|██████▏ | 1029/1680 [3:28:46<2:10:44, 12.05s/it] 61%|██████▏ | 1030/1680 [3:28:58<2:08:28, 11.86s/it] 61%|██████▏ | 1030/1680 [3:28:58<2:08:28, 11.86s/it] 61%|██████▏ | 1031/1680 [3:29:09<2:07:13, 11.76s/it] 61%|██████▏ | 1032/1680 [3:29:21<2:08:46, 11.92s/it] 61%|██████▏ | 1033/1680 [3:29:33<2:07:38, 11.84s/it] 62%|██████▏ | 1034/1680 [3:29:45<2:07:11, 11.81s/it] 62%|██████▏ | 1035/1680 [3:29:57<2:07:20, 11.85s/it] 62%|██████▏ | 1035/1680 [3:29:57<2:07:20, 11.85s/it] 62%|██████▏ | 1036/1680 [3:30:09<2:07:50, 11.91s/it] 62%|██████▏ | 1037/1680 [3:30:22<2:11:11, 12.24s/it] 62%|██████▏ | 1038/1680 [3:30:33<2:08:10, 11.98s/it] 62%|██████▏ | 1039/1680 [3:30:45<2:07:06, 11.90s/it] 62%|██████▏ | 1040/1680 [3:30:57<2:07:36, 11.96s/it] 62%|██████▏ | 1040/1680 [3:30:57<2:07:36, 11.96s/it] 62%|██████▏ | 1041/1680 [3:31:08<2:04:30, 11.69s/it] 62%|██████▏ | 1042/1680 [3:31:20<2:05:22, 11.79s/it] 62%|██████▏ | 1043/1680 [3:31:32<2:05:49, 11.85s/it] 62%|██████▏ | 1044/1680 [3:31:44<2:04:57, 11.79s/it] 62%|██████▏ | 1045/1680 [3:31:55<2:03:26, 11.66s/it] 62%|██████▏ | 1045/1680 [3:31:55<2:03:26, 11.66s/it] 62%|██████▏ | 1046/1680 [3:32:08<2:07:11, 12.04s/it] 62%|██████▏ | 1047/1680 [3:32:20<2:05:10, 11.86s/it] 62%|██████▏ | 1048/1680 [3:32:31<2:03:47, 11.75s/it] 62%|██████▏ | 1049/1680 [3:32:43<2:03:56, 11.79s/it] 62%|██████▎ | 1050/1680 [3:32:55<2:03:56, 11.80s/it] 62%|██████▎ | 1050/1680 [3:32:55<2:03:56, 11.80s/it][INFO|trainer.py:3819] 2024-08-13 13:09:23,191 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 13:09:23,192 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 13:09:23,192 >> Batch size = 1 {'eval_loss': 1.82525634765625, 'eval_runtime': 17.7537, 'eval_samples_per_second': 2.591, 'eval_steps_per_second': 2.591, 'epoch': 3.5} {'loss': 0.1947, 'grad_norm': 3.4772818088531494, 'learning_rate': 4.367974587438733e-05, 'epoch': 3.52} {'loss': 0.2352, 'grad_norm': 2.6401774883270264, 'learning_rate': 4.316481958449634e-05, 'epoch': 3.53} {'loss': 0.2047, 'grad_norm': 3.997591733932495, 'learning_rate': 4.2650630998791615e-05, 'epoch': 3.55} {'loss': 0.2369, 'grad_norm': 2.5615384578704834, 'learning_rate': 4.213723561238074e-05, 'epoch': 3.57} {'loss': 0.2416, 'grad_norm': 2.5114736557006836, 'learning_rate': 4.162468883476319e-05, 'epoch': 3.59} {'loss': 0.2353, 'grad_norm': 4.23993444442749, 'learning_rate': 4.111304598385018e-05, 'epoch': 3.61} {'loss': 0.2155, 'grad_norm': 3.239319324493408, 'learning_rate': 4.060236227999441e-05, 'epoch': 3.62} {'loss': 0.2241, 'grad_norm': 2.030393600463867, 'learning_rate': 4.0092692840030134e-05, 'epoch': 3.64} {'loss': 0.2408, 'grad_norm': 3.636963367462158, 'learning_rate': 3.9584092671324606e-05, 'epoch': 3.66} {'loss': 0.2423, 'grad_norm': 4.295063495635986, 'learning_rate': 3.907661666584131e-05, 'epoch': 3.68} {'loss': 0.2581, 'grad_norm': 3.268596887588501, 'learning_rate': 3.857031959421553e-05, 'epoch': 3.69} {'loss': 0.206, 'grad_norm': 3.0428457260131836, 'learning_rate': 3.806525609984312e-05, 'epoch': 3.71} {'loss': 0.1956, 'grad_norm': 3.523777484893799, 'learning_rate': 3.7561480692983006e-05, 'epoch': 3.73} {'loss': 0.2839, 'grad_norm': 2.972714900970459, 'learning_rate': 3.705904774487396e-05, 'epoch': 3.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1050 [INFO|configuration_utils.py:733] 2024-08-13 13:09:41,530 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 13:09:41,530 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 13:09:41,815 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1050/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 13:09:41,815 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1050/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 63%|██████▎ | 1051/1680 [3:33:26<3:03:59, 17.55s/it] 63%|██████▎ | 1052/1680 [3:33:38<2:45:47, 15.84s/it] 63%|██████▎ | 1053/1680 [3:33:50<2:34:37, 14.80s/it] 63%|██████▎ | 1054/1680 [3:34:02<2:25:30, 13.95s/it] 63%|██████▎ | 1055/1680 [3:34:14<2:19:13, 13.37s/it] 63%|██████▎ | 1055/1680 [3:34:14<2:19:13, 13.37s/it] 63%|██████▎ | 1056/1680 [3:34:25<2:11:53, 12.68s/it] 63%|██████▎ | 1057/1680 [3:34:37<2:09:38, 12.49s/it] 63%|██████▎ | 1058/1680 [3:34:48<2:05:50, 12.14s/it] 63%|██████▎ | 1059/1680 [3:34:59<2:02:30, 11.84s/it] 63%|██████▎ | 1060/1680 [3:35:11<2:00:53, 11.70s/it] 63%|██████▎ | 1060/1680 [3:35:11<2:00:53, 11.70s/it] 63%|██████▎ | 1061/1680 [3:35:22<2:00:35, 11.69s/it] 63%|██████▎ | 1062/1680 [3:35:34<2:00:03, 11.66s/it] 63%|██████▎ | 1063/1680 [3:35:46<1:59:23, 11.61s/it] 63%|██████▎ | 1064/1680 [3:35:57<1:58:14, 11.52s/it] 63%|██████▎ | 1065/1680 [3:36:09<2:00:42, 11.78s/it] 63%|██████▎ | 1065/1680 [3:36:09<2:00:42, 11.78s/it] 63%|██████▎ | 1066/1680 [3:36:21<2:01:24, 11.86s/it] 64%|██████▎ | 1067/1680 [3:36:33<1:59:37, 11.71s/it] 64%|██████▎ | 1068/1680 [3:36:44<1:59:00, 11.67s/it] 64%|██████▎ | 1069/1680 [3:36:56<1:59:45, 11.76s/it] 64%|██████▎ | 1070/1680 [3:37:08<1:59:55, 11.80s/it] 64%|██████▎ | 1070/1680 [3:37:08<1:59:55, 11.80s/it] 64%|██████▍ | 1071/1680 [3:37:20<2:01:32, 11.98s/it] 64%|██████▍ | 1072/1680 [3:37:32<1:59:50, 11.83s/it] 64%|██████▍ | 1073/1680 [3:37:44<1:59:09, 11.78s/it] 64%|██████▍ | 1074/1680 [3:37:56<1:59:18, 11.81s/it] 64%|██████▍ | 1075/1680 [3:38:07<1:57:34, 11.66s/it] 64%|██████▍ | 1075/1680 [3:38:07<1:57:34, 11.66s/it] 64%|██████▍ | 1076/1680 [3:38:18<1:56:23, 11.56s/it] 64%|██████▍ | 1077/1680 [3:38:29<1:54:38, 11.41s/it] 64%|██████▍ | 1078/1680 [3:38:41<1:54:34, 11.42s/it] 64%|██████▍ | 1079/1680 [3:38:52<1:55:11, 11.50s/it] 64%|██████▍ | 1080/1680 [3:39:05<1:57:03, 11.71s/it] 64%|██████▍ | 1080/1680 [3:39:05<1:57:03, 11.71s/it] 64%|██████▍ | 1081/1680 [3:39:18<2:00:51, 12.11s/it] 64%|██████▍ | 1082/1680 [3:39:30<2:00:14, 12.06s/it] 64%|██████▍ | 1083/1680 [3:39:41<1:59:41, 12.03s/it] 65%|██████▍ | 1084/1680 [3:39:54<1:59:46, 12.06s/it] 65%|██████▍ | 1085/1680 [3:40:05<1:58:13, 11.92s/it] 65%|██████▍ | 1085/1680 [3:40:05<1:58:13, 11.92s/it] 65%|██████▍ | 1086/1680 [3:40:17<1:58:57, 12.02s/it] 65%|██████▍ | 1087/1680 [3:40:29<1:58:37, 12.00s/it] 65%|██████▍ | 1088/1680 [3:40:41<1:58:15, 11.99s/it] 65%|██████▍ | 1089/1680 [3:40:53<1:57:36, 11.94s/it] 65%|██████▍ | 1090/1680 [3:41:05<1:57:44, 11.97s/it] 65%|██████▍ | 1090/1680 [3:41:05<1:57:44, 11.97s/it] 65%|██████▍ | 1091/1680 [3:41:17<1:56:42, 11.89s/it] 65%|██████▌ | 1092/1680 [3:41:30<1:58:39, 12.11s/it] 65%|██████▌ | 1093/1680 [3:41:42<1:59:06, 12.17s/it] 65%|██████▌ | 1094/1680 [3:41:53<1:57:05, 11.99s/it] 65%|██████▌ | 1095/1680 [3:42:05<1:56:40, 11.97s/it] 65%|██████▌ | 1095/1680 [3:42:05<1:56:40, 11.97s/it] 65%|██████▌ | 1096/1680 [3:42:17<1:55:41, 11.89s/it] 65%|██████▌ | 1097/1680 [3:42:29<1:54:15, 11.76s/it] 65%|██████▌ | 1098/1680 [3:42:40<1:53:31, 11.70s/it] 65%|██████▌ | 1099/1680 [3:42:52<1:53:18, 11.70s/it] 65%|██████▌ | 1100/1680 [3:43:04<1:53:35, 11.75s/it] 65%|██████▌ | 1100/1680 [3:43:04<1:53:35, 11.75s/it] 66%|██████▌ | 1101/1680 [3:43:15<1:52:24, 11.65s/it] 66%|██████▌ | 1102/1680 [3:43:27<1:52:09, 11.64s/it] 66%|██████▌ | 1103/1680 [3:43:38<1:50:52, 11.53s/it] 66%|██████▌ | 1104/1680 [3:43:50<1:51:58, 11.66s/it] 66%|██████▌ | 1105/1680 [3:44:01<1:50:52, 11.57s/it] 66%|██████▌ | 1105/1680 [3:44:01<1:50:52, 11.57s/it] 66%|██████▌ | 1106/1680 [3:44:13<1:52:29, 11.76s/it] 66%|██████▌ | 1107/1680 [3:44:25<1:52:06, 11.74s/it] 66%|██████▌ | 1108/1680 [3:44:37<1:53:31, 11.91s/it] 66%|██████▌ | 1109/1680 [3:44:50<1:54:16, 12.01s/it] 66%|██████▌ | 1110/1680 [3:45:02<1:53:26, 11.94s/it] 66%|██████▌ | 1110/1680 [3:45:02<1:53:26, 11.94s/it] 66%|██████▌ | 1111/1680 [3:45:13<1:52:13, 11.83s/it] 66%|██████▌ | 1112/1680 [3:45:25<1:53:09, 11.95s/it] 66%|██████▋ | 1113/1680 [3:45:37<1:51:07, 11.76s/it] 66%|██████▋ | 1114/1680 [3:45:48<1:50:52, 11.75s/it] 66%|██████▋ | 1115/1680 [3:46:00<1:50:56, 11.78s/it] 66%|██████▋ | 1115/1680 [3:46:00<1:50:56, 11.78s/it] 66%|██████▋ | 1116/1680 [3:46:12<1:50:25, 11.75s/it] 66%|██████▋ | 1117/1680 [3:46:24<1:50:22, 11.76s/it] 67%|██████▋ | 1118/1680 [3:46:34<1:47:17, 11.45s/it] 67%|██████▋ | 1119/1680 [3:46:47<1:49:49, 11.75s/it] 67%|██████▋ | 1120/1680 [3:47:00<1:52:24, 12.04s/it] 67%|██████▋ | 1120/1680 [3:47:00<1:52:24, 12.04s/it][INFO|trainer.py:3819] 2024-08-13 13:23:28,061 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 13:23:28,061 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 13:23:28,061 >> Batch size = 1 {'eval_loss': 1.8687995672225952, 'eval_runtime': 17.732, 'eval_samples_per_second': 2.594, 'eval_steps_per_second': 2.594, 'epoch': 3.75} {'loss': 0.2433, 'grad_norm': 3.9769251346588135, 'learning_rate': 3.655801148186655e-05, 'epoch': 3.77} {'loss': 0.2085, 'grad_norm': 3.03606915473938, 'learning_rate': 3.6058425979570485e-05, 'epoch': 3.78} {'loss': 0.2277, 'grad_norm': 3.5858893394470215, 'learning_rate': 3.556034515701852e-05, 'epoch': 3.8} {'loss': 0.2497, 'grad_norm': 2.5949602127075195, 'learning_rate': 3.506382277084696e-05, 'epoch': 3.82} {'loss': 0.2462, 'grad_norm': 2.8706088066101074, 'learning_rate': 3.4568912409493945e-05, 'epoch': 3.84} {'loss': 0.2004, 'grad_norm': 3.238346576690674, 'learning_rate': 3.4075667487415785e-05, 'epoch': 3.86} {'loss': 0.226, 'grad_norm': 3.36478590965271, 'learning_rate': 3.358414123932195e-05, 'epoch': 3.87} {'loss': 0.2114, 'grad_norm': 3.0954155921936035, 'learning_rate': 3.3094386714429724e-05, 'epoch': 3.89} {'loss': 0.2694, 'grad_norm': 3.016141891479492, 'learning_rate': 3.2606456770738636e-05, 'epoch': 3.91} {'loss': 0.1828, 'grad_norm': 2.976658821105957, 'learning_rate': 3.212040406932569e-05, 'epoch': 3.93} {'loss': 0.1451, 'grad_norm': 2.8186426162719727, 'learning_rate': 3.163628106866172e-05, 'epoch': 3.94} {'loss': 0.2349, 'grad_norm': 2.959024429321289, 'learning_rate': 3.115414001894974e-05, 'epoch': 3.96} {'loss': 0.2235, 'grad_norm': 2.9852728843688965, 'learning_rate': 3.067403295648566e-05, 'epoch': 3.98} {'loss': 0.2111, 'grad_norm': 2.79172945022583, 'learning_rate': 3.019601169804216e-05, 'epoch': 4.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1120 [INFO|configuration_utils.py:733] 2024-08-13 13:23:46,377 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 13:23:46,378 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 13:23:46,603 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1120/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 13:23:46,603 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1120/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 67%|██████▋ | 1121/1680 [3:47:31<2:47:12, 17.95s/it] 67%|██████▋ | 1122/1680 [3:47:43<2:30:40, 16.20s/it] 67%|██████▋ | 1123/1680 [3:47:55<2:18:35, 14.93s/it] 67%|██████▋ | 1124/1680 [3:48:07<2:09:01, 13.92s/it] 67%|██████▋ | 1125/1680 [3:48:19<2:04:19, 13.44s/it] 67%|██████▋ | 1125/1680 [3:48:19<2:04:19, 13.44s/it] 67%|██████▋ | 1126/1680 [3:48:30<1:56:49, 12.65s/it] 67%|██████▋ | 1127/1680 [3:48:42<1:54:21, 12.41s/it] 67%|██████▋ | 1128/1680 [3:48:54<1:53:47, 12.37s/it] 67%|██████▋ | 1129/1680 [3:49:07<1:55:00, 12.52s/it] 67%|██████▋ | 1130/1680 [3:49:19<1:53:31, 12.38s/it] 67%|██████▋ | 1130/1680 [3:49:19<1:53:31, 12.38s/it] 67%|██████▋ | 1131/1680 [3:49:31<1:52:29, 12.29s/it] 67%|██████▋ | 1132/1680 [3:49:43<1:51:29, 12.21s/it] 67%|██████▋ | 1133/1680 [3:49:55<1:49:35, 12.02s/it] 68%|██████▊ | 1134/1680 [3:50:07<1:49:26, 12.03s/it] 68%|██████▊ | 1135/1680 [3:50:19<1:48:13, 11.91s/it] 68%|██████▊ | 1135/1680 [3:50:19<1:48:13, 11.91s/it] 68%|██████▊ | 1136/1680 [3:50:30<1:46:00, 11.69s/it] 68%|██████▊ | 1137/1680 [3:50:41<1:44:45, 11.58s/it] 68%|██████▊ | 1138/1680 [3:50:53<1:44:43, 11.59s/it] 68%|██████▊ | 1139/1680 [3:51:05<1:47:06, 11.88s/it] 68%|██████▊ | 1140/1680 [3:51:17<1:47:14, 11.92s/it] 68%|██████▊ | 1140/1680 [3:51:17<1:47:14, 11.92s/it] 68%|██████▊ | 1141/1680 [3:51:28<1:44:44, 11.66s/it] 68%|██████▊ | 1142/1680 [3:51:40<1:43:37, 11.56s/it] 68%|██████▊ | 1143/1680 [3:51:51<1:43:01, 11.51s/it] 68%|██████▊ | 1144/1680 [3:52:03<1:44:12, 11.66s/it] 68%|██████▊ | 1145/1680 [3:52:15<1:44:52, 11.76s/it] 68%|██████▊ | 1145/1680 [3:52:15<1:44:52, 11.76s/it] 68%|██████▊ | 1146/1680 [3:52:27<1:44:12, 11.71s/it] 68%|██████▊ | 1147/1680 [3:52:39<1:46:14, 11.96s/it] 68%|██████▊ | 1148/1680 [3:52:51<1:45:55, 11.95s/it] 68%|██████▊ | 1149/1680 [3:53:02<1:44:20, 11.79s/it] 68%|██████▊ | 1150/1680 [3:53:14<1:44:45, 11.86s/it] 68%|██████▊ | 1150/1680 [3:53:14<1:44:45, 11.86s/it] 69%|██████▊ | 1151/1680 [3:53:25<1:41:52, 11.55s/it] 69%|██████▊ | 1152/1680 [3:53:38<1:44:44, 11.90s/it] 69%|██████▊ | 1153/1680 [3:53:49<1:42:51, 11.71s/it] 69%|██████▊ | 1154/1680 [3:54:01<1:42:24, 11.68s/it] 69%|██████▉ | 1155/1680 [3:54:12<1:41:26, 11.59s/it] 69%|██████▉ | 1155/1680 [3:54:12<1:41:26, 11.59s/it] 69%|██████▉ | 1156/1680 [3:54:24<1:42:48, 11.77s/it] 69%|██████▉ | 1157/1680 [3:54:36<1:42:33, 11.77s/it] 69%|██████▉ | 1158/1680 [3:54:49<1:45:02, 12.07s/it] 69%|██████▉ | 1159/1680 [3:55:02<1:46:25, 12.26s/it] 69%|██████▉ | 1160/1680 [3:55:13<1:43:36, 11.95s/it] 69%|██████▉ | 1160/1680 [3:55:13<1:43:36, 11.95s/it] 69%|██████▉ | 1161/1680 [3:55:26<1:45:05, 12.15s/it] 69%|██████▉ | 1162/1680 [3:55:38<1:45:14, 12.19s/it] 69%|██████▉ | 1163/1680 [3:55:50<1:44:18, 12.11s/it] 69%|██████▉ | 1164/1680 [3:56:01<1:42:50, 11.96s/it] 69%|██████▉ | 1165/1680 [3:56:13<1:42:49, 11.98s/it] 69%|██████▉ | 1165/1680 [3:56:13<1:42:49, 11.98s/it] 69%|██████▉ | 1166/1680 [3:56:24<1:40:18, 11.71s/it] 69%|██████▉ | 1167/1680 [3:56:36<1:40:07, 11.71s/it] 70%|██████▉ | 1168/1680 [3:56:49<1:41:47, 11.93s/it] 70%|██████▉ | 1169/1680 [3:57:00<1:41:01, 11.86s/it] 70%|██████▉ | 1170/1680 [3:57:12<1:40:52, 11.87s/it] 70%|██████▉ | 1170/1680 [3:57:12<1:40:52, 11.87s/it] 70%|██████▉ | 1171/1680 [3:57:24<1:40:57, 11.90s/it] 70%|██████▉ | 1172/1680 [3:57:36<1:40:01, 11.81s/it] 70%|██████▉ | 1173/1680 [3:57:49<1:42:23, 12.12s/it] 70%|██████▉ | 1174/1680 [3:57:59<1:38:50, 11.72s/it] 70%|██████▉ | 1175/1680 [3:58:11<1:38:57, 11.76s/it] 70%|██████▉ | 1175/1680 [3:58:11<1:38:57, 11.76s/it] 70%|███████ | 1176/1680 [3:58:23<1:39:03, 11.79s/it] 70%|███████ | 1177/1680 [3:58:34<1:37:46, 11.66s/it] 70%|███████ | 1178/1680 [3:58:47<1:39:26, 11.89s/it] 70%|███████ | 1179/1680 [3:58:59<1:38:45, 11.83s/it] 70%|███████ | 1180/1680 [3:59:11<1:40:01, 12.00s/it] 70%|███████ | 1180/1680 [3:59:11<1:40:01, 12.00s/it] 70%|███████ | 1181/1680 [3:59:23<1:39:09, 11.92s/it] 70%|███████ | 1182/1680 [3:59:35<1:39:58, 12.05s/it] 70%|███████ | 1183/1680 [3:59:48<1:41:28, 12.25s/it] 70%|███████ | 1184/1680 [3:59:59<1:38:30, 11.92s/it] 71%|███████ | 1185/1680 [4:00:10<1:35:27, 11.57s/it] 71%|███████ | 1185/1680 [4:00:10<1:35:27, 11.57s/it] 71%|███████ | 1186/1680 [4:00:22<1:35:58, 11.66s/it] 71%|███████ | 1187/1680 [4:00:34<1:38:04, 11.94s/it] 71%|███████ | 1188/1680 [4:00:46<1:36:58, 11.83s/it] 71%|███████ | 1189/1680 [4:00:58<1:37:43, 11.94s/it] 71%|███████ | 1190/1680 [4:01:10<1:37:52, 11.99s/it] 71%|███████ | 1190/1680 [4:01:10<1:37:52, 11.99s/it][INFO|trainer.py:3819] 2024-08-13 13:37:38,497 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 13:37:38,497 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 13:37:38,498 >> Batch size = 1 {'eval_loss': 1.891045093536377, 'eval_runtime': 17.7382, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 4.0} {'loss': 0.1074, 'grad_norm': 1.1968103647232056, 'learning_rate': 2.9720127835276256e-05, 'epoch': 4.02} {'loss': 0.0628, 'grad_norm': 1.4865480661392212, 'learning_rate': 2.9246432729161055e-05, 'epoch': 4.03} {'loss': 0.0615, 'grad_norm': 2.913541078567505, 'learning_rate': 2.8774977504442647e-05, 'epoch': 4.05} {'loss': 0.0658, 'grad_norm': 2.1043801307678223, 'learning_rate': 2.8305813044122097e-05, 'epoch': 4.07} {'loss': 0.0458, 'grad_norm': 1.942076325416565, 'learning_rate': 2.7838989983964065e-05, 'epoch': 4.09} {'loss': 0.0877, 'grad_norm': 2.3953213691711426, 'learning_rate': 2.737455870703155e-05, 'epoch': 4.11} {'loss': 0.0567, 'grad_norm': 1.9993913173675537, 'learning_rate': 2.6912569338248315e-05, 'epoch': 4.12} {'loss': 0.0817, 'grad_norm': 2.4731192588806152, 'learning_rate': 2.645307173898901e-05, 'epoch': 4.14} {'loss': 0.0517, 'grad_norm': 2.3913474082946777, 'learning_rate': 2.5996115501697694e-05, 'epoch': 4.16} {'loss': 0.0649, 'grad_norm': 4.154366493225098, 'learning_rate': 2.5541749944535554e-05, 'epoch': 4.18} {'loss': 0.0613, 'grad_norm': 1.4376811981201172, 'learning_rate': 2.5090024106057962e-05, 'epoch': 4.19} {'loss': 0.0763, 'grad_norm': 2.038010835647583, 'learning_rate': 2.464098673992205e-05, 'epoch': 4.21} {'loss': 0.0733, 'grad_norm': 1.862741470336914, 'learning_rate': 2.4194686309624663e-05, 'epoch': 4.23} {'loss': 0.0753, 'grad_norm': 2.7354800701141357, 'learning_rate': 2.3751170983272e-05, 'epoch': 4.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1190 [INFO|configuration_utils.py:733] 2024-08-13 13:37:57,186 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 13:37:57,186 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 13:37:57,603 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1190/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 13:37:57,604 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1190/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 71%|███████ | 1191/1680 [4:01:42<2:26:47, 18.01s/it] 71%|███████ | 1192/1680 [4:01:53<2:09:16, 15.89s/it] 71%|███████ | 1193/1680 [4:02:05<1:58:58, 14.66s/it] 71%|███████ | 1194/1680 [4:02:17<1:53:39, 14.03s/it] 71%|███████ | 1195/1680 [4:02:29<1:47:28, 13.30s/it] 71%|███████ | 1195/1680 [4:02:29<1:47:28, 13.30s/it] 71%|███████ | 1196/1680 [4:02:40<1:42:37, 12.72s/it] 71%|███████▏ | 1197/1680 [4:02:51<1:38:22, 12.22s/it] 71%|███████▏ | 1198/1680 [4:03:04<1:38:35, 12.27s/it] 71%|███████▏ | 1199/1680 [4:03:16<1:38:01, 12.23s/it] 71%|███████▏ | 1200/1680 [4:03:27<1:36:16, 12.03s/it] 71%|███████▏ | 1200/1680 [4:03:27<1:36:16, 12.03s/it] 71%|███████▏ | 1201/1680 [4:03:39<1:34:43, 11.86s/it] 72%|███████▏ | 1202/1680 [4:03:51<1:36:03, 12.06s/it] 72%|███████▏ | 1203/1680 [4:04:04<1:36:51, 12.18s/it] 72%|███████▏ | 1204/1680 [4:04:16<1:36:49, 12.20s/it] 72%|███████▏ | 1205/1680 [4:04:28<1:36:45, 12.22s/it] 72%|███████▏ | 1205/1680 [4:04:28<1:36:45, 12.22s/it] 72%|███████▏ | 1206/1680 [4:04:39<1:33:37, 11.85s/it] 72%|███████▏ | 1207/1680 [4:04:51<1:32:24, 11.72s/it] 72%|███████▏ | 1208/1680 [4:05:02<1:31:09, 11.59s/it] 72%|███████▏ | 1209/1680 [4:05:15<1:33:40, 11.93s/it] 72%|███████▏ | 1210/1680 [4:05:27<1:33:25, 11.93s/it] 72%|███████▏ | 1210/1680 [4:05:27<1:33:25, 11.93s/it] 72%|███████▏ | 1211/1680 [4:05:39<1:33:30, 11.96s/it] 72%|███████▏ | 1212/1680 [4:05:51<1:33:23, 11.97s/it] 72%|███████▏ | 1213/1680 [4:06:03<1:33:17, 11.99s/it] 72%|███████▏ | 1214/1680 [4:06:15<1:33:07, 11.99s/it] 72%|███████▏ | 1215/1680 [4:06:26<1:30:37, 11.69s/it] 72%|███████▏ | 1215/1680 [4:06:26<1:30:37, 11.69s/it] 72%|███████▏ | 1216/1680 [4:06:39<1:33:23, 12.08s/it] 72%|███████▏ | 1217/1680 [4:06:50<1:30:41, 11.75s/it] 72%|███████▎ | 1218/1680 [4:07:02<1:31:36, 11.90s/it] 73%|███████▎ | 1219/1680 [4:07:14<1:31:46, 11.94s/it] 73%|███████▎ | 1220/1680 [4:07:26<1:31:05, 11.88s/it] 73%|███████▎ | 1220/1680 [4:07:26<1:31:05, 11.88s/it] 73%|███████▎ | 1221/1680 [4:07:38<1:31:16, 11.93s/it] 73%|███████▎ | 1222/1680 [4:07:50<1:30:23, 11.84s/it] 73%|███████▎ | 1223/1680 [4:08:01<1:28:19, 11.60s/it] 73%|███████▎ | 1224/1680 [4:08:12<1:28:52, 11.69s/it] 73%|███████▎ | 1225/1680 [4:08:24<1:29:22, 11.79s/it] 73%|███████▎ | 1225/1680 [4:08:24<1:29:22, 11.79s/it] 73%|███████▎ | 1226/1680 [4:08:37<1:31:32, 12.10s/it] 73%|███████▎ | 1227/1680 [4:08:50<1:32:02, 12.19s/it] 73%|███████▎ | 1228/1680 [4:09:02<1:32:48, 12.32s/it] 73%|███████▎ | 1229/1680 [4:09:14<1:31:37, 12.19s/it] 73%|███████▎ | 1230/1680 [4:09:26<1:30:45, 12.10s/it] 73%|███████▎ | 1230/1680 [4:09:26<1:30:45, 12.10s/it] 73%|███████▎ | 1231/1680 [4:09:38<1:30:06, 12.04s/it] 73%|███████▎ | 1232/1680 [4:09:49<1:28:36, 11.87s/it] 73%|███████▎ | 1233/1680 [4:10:01<1:28:29, 11.88s/it] 73%|███████▎ | 1234/1680 [4:10:13<1:28:46, 11.94s/it] 74%|███████▎ | 1235/1680 [4:10:25<1:27:02, 11.74s/it] 74%|███████▎ | 1235/1680 [4:10:25<1:27:02, 11.74s/it] 74%|███████▎ | 1236/1680 [4:10:38<1:29:12, 12.06s/it] 74%|███████▎ | 1237/1680 [4:10:49<1:28:28, 11.98s/it] 74%|███████▎ | 1238/1680 [4:11:02<1:29:16, 12.12s/it] 74%|███████▍ | 1239/1680 [4:11:13<1:27:22, 11.89s/it] 74%|███████▍ | 1240/1680 [4:11:25<1:27:41, 11.96s/it] 74%|███████▍ | 1240/1680 [4:11:25<1:27:41, 11.96s/it] 74%|███████▍ | 1241/1680 [4:11:37<1:26:43, 11.85s/it] 74%|███████▍ | 1242/1680 [4:11:49<1:26:35, 11.86s/it] 74%|███████▍ | 1243/1680 [4:12:00<1:25:28, 11.73s/it] 74%|███████▍ | 1244/1680 [4:12:11<1:23:41, 11.52s/it] 74%|███████▍ | 1245/1680 [4:12:24<1:25:40, 11.82s/it] 74%|███████▍ | 1245/1680 [4:12:24<1:25:40, 11.82s/it] 74%|███████▍ | 1246/1680 [4:12:35<1:24:23, 11.67s/it] 74%|███████▍ | 1247/1680 [4:12:47<1:24:00, 11.64s/it] 74%|███████▍ | 1248/1680 [4:12:58<1:23:49, 11.64s/it] 74%|███████▍ | 1249/1680 [4:13:09<1:21:42, 11.37s/it] 74%|███████▍ | 1250/1680 [4:13:20<1:21:28, 11.37s/it] 74%|███████▍ | 1250/1680 [4:13:20<1:21:28, 11.37s/it] 74%|███████▍ | 1251/1680 [4:13:32<1:21:30, 11.40s/it] 75%|███████▍ | 1252/1680 [4:13:43<1:21:31, 11.43s/it] 75%|███████▍ | 1253/1680 [4:13:55<1:22:43, 11.62s/it] 75%|███████▍ | 1254/1680 [4:14:07<1:23:07, 11.71s/it] 75%|███████▍ | 1255/1680 [4:14:19<1:22:50, 11.69s/it] 75%|███████▍ | 1255/1680 [4:14:19<1:22:50, 11.69s/it] 75%|███████▍ | 1256/1680 [4:14:30<1:20:35, 11.40s/it] 75%|███████▍ | 1257/1680 [4:14:42<1:22:01, 11.63s/it] 75%|███████▍ | 1258/1680 [4:14:53<1:20:46, 11.49s/it] 75%|███████▍ | 1259/1680 [4:15:05<1:21:00, 11.55s/it] 75%|███████▌ | 1260/1680 [4:15:17<1:22:37, 11.80s/it] 75%|███████▌ | 1260/1680 [4:15:17<1:22:37, 11.80s/it][INFO|trainer.py:3819] 2024-08-13 13:51:45,570 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 13:51:45,571 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 13:51:45,571 >> Batch size = 1 {'eval_loss': 2.2224178314208984, 'eval_runtime': 17.7489, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 4.25} {'loss': 0.0839, 'grad_norm': 1.102008581161499, 'learning_rate': 2.3310488628380757e-05, 'epoch': 4.27} {'loss': 0.0811, 'grad_norm': 4.02572774887085, 'learning_rate': 2.2872686806712035e-05, 'epoch': 4.28} {'loss': 0.0783, 'grad_norm': 1.9711402654647827, 'learning_rate': 2.243781276913811e-05, 'epoch': 4.3} {'loss': 0.0488, 'grad_norm': 2.0151891708374023, 'learning_rate': 2.200591345054267e-05, 'epoch': 4.32} {'loss': 0.0704, 'grad_norm': 4.591026782989502, 'learning_rate': 2.157703546475539e-05, 'epoch': 4.34} {'loss': 0.0653, 'grad_norm': 1.2874963283538818, 'learning_rate': 2.115122509952085e-05, 'epoch': 4.36} {'loss': 0.0471, 'grad_norm': 2.7136454582214355, 'learning_rate': 2.0728528311502976e-05, 'epoch': 4.37} {'loss': 0.0757, 'grad_norm': 2.6785166263580322, 'learning_rate': 2.0308990721324927e-05, 'epoch': 4.39} {'loss': 0.0456, 'grad_norm': 1.6510692834854126, 'learning_rate': 1.989265760864542e-05, 'epoch': 4.41} {'loss': 0.0555, 'grad_norm': 1.2233620882034302, 'learning_rate': 1.947957390727185e-05, 'epoch': 4.43} {'loss': 0.0559, 'grad_norm': 2.3564908504486084, 'learning_rate': 1.906978420031059e-05, 'epoch': 4.44} {'loss': 0.0395, 'grad_norm': 1.9344422817230225, 'learning_rate': 1.8663332715355396e-05, 'epoch': 4.46} {'loss': 0.0681, 'grad_norm': 1.6214028596878052, 'learning_rate': 1.8260263319713844e-05, 'epoch': 4.48} {'loss': 0.072, 'grad_norm': 2.0569422245025635, 'learning_rate': 1.7860619515673033e-05, 'epoch': 4.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1260 [INFO|configuration_utils.py:733] 2024-08-13 13:52:04,337 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 13:52:04,338 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 13:52:04,618 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1260/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 13:52:04,618 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1260/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 75%|███████▌ | 1261/1680 [4:15:49<2:04:08, 17.78s/it] 75%|███████▌ | 1262/1680 [4:16:01<1:52:07, 16.10s/it] 75%|███████▌ | 1263/1680 [4:16:13<1:43:51, 14.94s/it] 75%|███████▌ | 1264/1680 [4:16:25<1:36:50, 13.97s/it] 75%|███████▌ | 1265/1680 [4:16:37<1:33:37, 13.54s/it] 75%|███████▌ | 1265/1680 [4:16:37<1:33:37, 13.54s/it] 75%|███████▌ | 1266/1680 [4:16:49<1:30:08, 13.06s/it] 75%|███████▌ | 1267/1680 [4:17:01<1:26:35, 12.58s/it] 75%|███████▌ | 1268/1680 [4:17:13<1:26:15, 12.56s/it] 76%|███████▌ | 1269/1680 [4:17:25<1:23:49, 12.24s/it] 76%|███████▌ | 1270/1680 [4:17:37<1:22:35, 12.09s/it] 76%|███████▌ | 1270/1680 [4:17:37<1:22:35, 12.09s/it] 76%|███████▌ | 1271/1680 [4:17:49<1:22:46, 12.14s/it] 76%|███████▌ | 1272/1680 [4:18:00<1:21:08, 11.93s/it] 76%|███████▌ | 1273/1680 [4:18:13<1:21:41, 12.04s/it] 76%|███████▌ | 1274/1680 [4:18:25<1:22:16, 12.16s/it] 76%|███████▌ | 1275/1680 [4:18:38<1:22:52, 12.28s/it] 76%|███████▌ | 1275/1680 [4:18:38<1:22:52, 12.28s/it] 76%|███████▌ | 1276/1680 [4:18:50<1:22:37, 12.27s/it] 76%|███████▌ | 1277/1680 [4:19:01<1:20:38, 12.01s/it] 76%|███████▌ | 1278/1680 [4:19:14<1:21:37, 12.18s/it] 76%|███████▌ | 1279/1680 [4:19:25<1:19:57, 11.96s/it] 76%|███████▌ | 1280/1680 [4:19:38<1:20:24, 12.06s/it] 76%|███████▌ | 1280/1680 [4:19:38<1:20:24, 12.06s/it] 76%|███████▋ | 1281/1680 [4:19:49<1:19:36, 11.97s/it] 76%|███████▋ | 1282/1680 [4:20:02<1:20:47, 12.18s/it] 76%|███████▋ | 1283/1680 [4:20:14<1:20:58, 12.24s/it] 76%|███████▋ | 1284/1680 [4:20:26<1:19:41, 12.07s/it] 76%|███████▋ | 1285/1680 [4:20:38<1:19:03, 12.01s/it] 76%|███████▋ | 1285/1680 [4:20:38<1:19:03, 12.01s/it] 77%|███████▋ | 1286/1680 [4:20:49<1:16:50, 11.70s/it] 77%|███████▋ | 1287/1680 [4:21:01<1:17:05, 11.77s/it] 77%|███████▋ | 1288/1680 [4:21:13<1:17:57, 11.93s/it] 77%|███████▋ | 1289/1680 [4:21:25<1:17:59, 11.97s/it] 77%|███████▋ | 1290/1680 [4:21:37<1:17:42, 11.95s/it] 77%|███████▋ | 1290/1680 [4:21:37<1:17:42, 11.95s/it] 77%|███████▋ | 1291/1680 [4:21:48<1:16:17, 11.77s/it] 77%|███████▋ | 1292/1680 [4:22:01<1:17:13, 11.94s/it] 77%|███████▋ | 1293/1680 [4:22:13<1:17:00, 11.94s/it] 77%|███████▋ | 1294/1680 [4:22:24<1:15:43, 11.77s/it] 77%|███████▋ | 1295/1680 [4:22:36<1:16:09, 11.87s/it] 77%|███████▋ | 1295/1680 [4:22:36<1:16:09, 11.87s/it] 77%|███████▋ | 1296/1680 [4:22:48<1:15:30, 11.80s/it] 77%|███████▋ | 1297/1680 [4:23:00<1:16:33, 11.99s/it] 77%|███████▋ | 1298/1680 [4:23:13<1:17:31, 12.18s/it] 77%|███████▋ | 1299/1680 [4:23:24<1:16:06, 11.99s/it] 77%|███████▋ | 1300/1680 [4:23:37<1:16:03, 12.01s/it] 77%|███████▋ | 1300/1680 [4:23:37<1:16:03, 12.01s/it] 77%|███████▋ | 1301/1680 [4:23:49<1:16:15, 12.07s/it] 78%|███████▊ | 1302/1680 [4:24:01<1:16:46, 12.19s/it] 78%|███████▊ | 1303/1680 [4:24:13<1:16:05, 12.11s/it] 78%|███████▊ | 1304/1680 [4:24:26<1:16:36, 12.22s/it] 78%|███████▊ | 1305/1680 [4:24:37<1:14:58, 12.00s/it] 78%|███████▊ | 1305/1680 [4:24:37<1:14:58, 12.00s/it] 78%|███████▊ | 1306/1680 [4:24:49<1:14:29, 11.95s/it] 78%|███████▊ | 1307/1680 [4:25:01<1:14:07, 11.92s/it] 78%|███████▊ | 1308/1680 [4:25:13<1:13:45, 11.90s/it] 78%|███████▊ | 1309/1680 [4:25:24<1:12:47, 11.77s/it] 78%|███████▊ | 1310/1680 [4:25:36<1:13:13, 11.87s/it] 78%|███████▊ | 1310/1680 [4:25:36<1:13:13, 11.87s/it] 78%|███████▊ | 1311/1680 [4:25:48<1:12:36, 11.81s/it] 78%|███████▊ | 1312/1680 [4:25:59<1:11:52, 11.72s/it] 78%|███████▊ | 1313/1680 [4:26:12<1:12:45, 11.89s/it] 78%|███████▊ | 1314/1680 [4:26:23<1:11:46, 11.77s/it] 78%|███████▊ | 1315/1680 [4:26:35<1:11:58, 11.83s/it] 78%|███████▊ | 1315/1680 [4:26:35<1:11:58, 11.83s/it] 78%|███████▊ | 1316/1680 [4:26:47<1:11:12, 11.74s/it] 78%|███████▊ | 1317/1680 [4:26:59<1:11:48, 11.87s/it] 78%|███████▊ | 1318/1680 [4:27:11<1:11:18, 11.82s/it] 79%|███████▊ | 1319/1680 [4:27:22<1:10:37, 11.74s/it] 79%|███████▊ | 1320/1680 [4:27:35<1:11:43, 11.95s/it] 79%|███████▊ | 1320/1680 [4:27:35<1:11:43, 11.95s/it] 79%|███████▊ | 1321/1680 [4:27:47<1:12:17, 12.08s/it] 79%|███████▊ | 1322/1680 [4:27:58<1:10:19, 11.79s/it] 79%|███████▉ | 1323/1680 [4:28:09<1:09:25, 11.67s/it] 79%|███████▉ | 1324/1680 [4:28:22<1:10:03, 11.81s/it] 79%|███████▉ | 1325/1680 [4:28:33<1:10:09, 11.86s/it] 79%|███████▉ | 1325/1680 [4:28:33<1:10:09, 11.86s/it] 79%|███████▉ | 1326/1680 [4:28:46<1:10:23, 11.93s/it] 79%|███████▉ | 1327/1680 [4:28:57<1:09:40, 11.84s/it] 79%|███████▉ | 1328/1680 [4:29:09<1:09:40, 11.88s/it] 79%|███████▉ | 1329/1680 [4:29:20<1:08:16, 11.67s/it] 79%|███████▉ | 1330/1680 [4:29:32<1:07:29, 11.57s/it] 79%|███████▉ | 1330/1680 [4:29:32<1:07:29, 11.57s/it][INFO|trainer.py:3819] 2024-08-13 14:06:00,207 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 14:06:00,207 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 14:06:00,207 >> Batch size = 1 {'eval_loss': 2.309265613555908, 'eval_runtime': 17.7487, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 4.5} {'loss': 0.0614, 'grad_norm': 2.3488802909851074, 'learning_rate': 1.746444443580433e-05, 'epoch': 4.52} {'loss': 0.0644, 'grad_norm': 2.056544303894043, 'learning_rate': 1.7071780838308288e-05, 'epoch': 4.53} {'loss': 0.0678, 'grad_norm': 2.576493740081787, 'learning_rate': 1.6682671102399805e-05, 'epoch': 4.55} {'loss': 0.0474, 'grad_norm': 1.5977071523666382, 'learning_rate': 1.629715722373423e-05, 'epoch': 4.57} {'loss': 0.0813, 'grad_norm': 3.0858843326568604, 'learning_rate': 1.5915280809874932e-05, 'epoch': 4.59} {'loss': 0.0483, 'grad_norm': 2.914644241333008, 'learning_rate': 1.553708307580265e-05, 'epoch': 4.61} {'loss': 0.0644, 'grad_norm': 2.8291921615600586, 'learning_rate': 1.5162604839467265e-05, 'epoch': 4.62} {'loss': 0.0581, 'grad_norm': 2.5296852588653564, 'learning_rate': 1.4791886517382413e-05, 'epoch': 4.64} {'loss': 0.0569, 'grad_norm': 1.3932641744613647, 'learning_rate': 1.4424968120263504e-05, 'epoch': 4.66} {'loss': 0.0645, 'grad_norm': 1.6407183408737183, 'learning_rate': 1.4061889248709343e-05, 'epoch': 4.68} {'loss': 0.0588, 'grad_norm': 2.565559148788452, 'learning_rate': 1.370268908892825e-05, 'epoch': 4.69} {'loss': 0.078, 'grad_norm': 2.400225877761841, 'learning_rate': 1.3347406408508695e-05, 'epoch': 4.71} {'loss': 0.0948, 'grad_norm': 3.091597318649292, 'learning_rate': 1.2996079552235263e-05, 'epoch': 4.73} {'loss': 0.0351, 'grad_norm': 2.0770254135131836, 'learning_rate': 1.264874643795021e-05, 'epoch': 4.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1330 [INFO|configuration_utils.py:733] 2024-08-13 14:06:19,525 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 14:06:19,525 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 14:06:19,942 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1330/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 14:06:19,943 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1330/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 79%|███████▉ | 1331/1680 [4:30:04<1:42:39, 17.65s/it] 79%|███████▉ | 1332/1680 [4:30:15<1:31:29, 15.77s/it] 79%|███████▉ | 1333/1680 [4:30:27<1:24:37, 14.63s/it] 79%|███████▉ | 1334/1680 [4:30:38<1:18:03, 13.54s/it] 79%|███████▉ | 1335/1680 [4:30:50<1:14:29, 12.96s/it] 79%|███████▉ | 1335/1680 [4:30:50<1:14:29, 12.96s/it] 80%|███████▉ | 1336/1680 [4:31:02<1:13:00, 12.73s/it] 80%|███████▉ | 1337/1680 [4:31:13<1:10:50, 12.39s/it] 80%|███████▉ | 1338/1680 [4:31:26<1:10:26, 12.36s/it] 80%|███████▉ | 1339/1680 [4:31:38<1:09:56, 12.31s/it] 80%|███████▉ | 1340/1680 [4:31:50<1:10:15, 12.40s/it] 80%|███████▉ | 1340/1680 [4:31:50<1:10:15, 12.40s/it] 80%|███████▉ | 1341/1680 [4:32:02<1:09:00, 12.21s/it] 80%|███████▉ | 1342/1680 [4:32:14<1:08:12, 12.11s/it] 80%|███████▉ | 1343/1680 [4:32:27<1:08:57, 12.28s/it] 80%|████████ | 1344/1680 [4:32:39<1:08:17, 12.20s/it] 80%|████████ | 1345/1680 [4:32:51<1:07:35, 12.10s/it] 80%|████████ | 1345/1680 [4:32:51<1:07:35, 12.10s/it] 80%|████████ | 1346/1680 [4:33:03<1:08:19, 12.28s/it] 80%|████████ | 1347/1680 [4:33:15<1:07:42, 12.20s/it] 80%|████████ | 1348/1680 [4:33:27<1:06:21, 11.99s/it] 80%|████████ | 1349/1680 [4:33:39<1:05:59, 11.96s/it] 80%|████████ | 1350/1680 [4:33:51<1:05:49, 11.97s/it] 80%|████████ | 1350/1680 [4:33:51<1:05:49, 11.97s/it] 80%|████████ | 1351/1680 [4:34:02<1:05:15, 11.90s/it] 80%|████████ | 1352/1680 [4:34:14<1:04:01, 11.71s/it] 81%|████████ | 1353/1680 [4:34:26<1:04:01, 11.75s/it] 81%|████████ | 1354/1680 [4:34:38<1:04:13, 11.82s/it] 81%|████████ | 1355/1680 [4:34:49<1:04:12, 11.85s/it] 81%|████████ | 1355/1680 [4:34:49<1:04:12, 11.85s/it] 81%|████████ | 1356/1680 [4:35:01<1:03:02, 11.68s/it] 81%|████████ | 1357/1680 [4:35:13<1:03:02, 11.71s/it] 81%|████████ | 1358/1680 [4:35:25<1:04:38, 12.05s/it] 81%|████████ | 1359/1680 [4:35:37<1:03:11, 11.81s/it] 81%|████████ | 1360/1680 [4:35:49<1:03:37, 11.93s/it] 81%|████████ | 1360/1680 [4:35:49<1:03:37, 11.93s/it] 81%|████████ | 1361/1680 [4:36:01<1:03:15, 11.90s/it] 81%|████████ | 1362/1680 [4:36:12<1:02:51, 11.86s/it] 81%|████████ | 1363/1680 [4:36:25<1:03:21, 11.99s/it] 81%|████████ | 1364/1680 [4:36:37<1:04:21, 12.22s/it] 81%|████████▏ | 1365/1680 [4:36:49<1:02:44, 11.95s/it] 81%|████████▏ | 1365/1680 [4:36:49<1:02:44, 11.95s/it] 81%|████████▏ | 1366/1680 [4:37:01<1:03:07, 12.06s/it] 81%|████████▏ | 1367/1680 [4:37:14<1:03:36, 12.19s/it] 81%|████████▏ | 1368/1680 [4:37:25<1:02:42, 12.06s/it] 81%|████████▏ | 1369/1680 [4:37:37<1:02:09, 11.99s/it] 82%|████████▏ | 1370/1680 [4:37:50<1:03:08, 12.22s/it] 82%|████████▏ | 1370/1680 [4:37:50<1:03:08, 12.22s/it] 82%|████████▏ | 1371/1680 [4:38:02<1:03:07, 12.26s/it] 82%|████████▏ | 1372/1680 [4:38:13<1:01:08, 11.91s/it] 82%|████████▏ | 1373/1680 [4:38:25<1:00:43, 11.87s/it] 82%|████████▏ | 1374/1680 [4:38:38<1:01:31, 12.06s/it] 82%|████████▏ | 1375/1680 [4:38:49<1:00:11, 11.84s/it] 82%|████████▏ | 1375/1680 [4:38:49<1:00:11, 11.84s/it] 82%|████████▏ | 1376/1680 [4:39:00<59:16, 11.70s/it] 82%|████████▏ | 1377/1680 [4:39:12<59:08, 11.71s/it] 82%|████████▏ | 1378/1680 [4:39:24<59:36, 11.84s/it] 82%|████████▏ | 1379/1680 [4:39:36<59:30, 11.86s/it] 82%|████████▏ | 1380/1680 [4:39:48<58:57, 11.79s/it] 82%|████████▏ | 1380/1680 [4:39:48<58:57, 11.79s/it] 82%|████████▏ | 1381/1680 [4:40:01<1:00:12, 12.08s/it] 82%|████████▏ | 1382/1680 [4:40:12<59:39, 12.01s/it] 82%|████████▏ | 1383/1680 [4:40:25<1:00:11, 12.16s/it] 82%|████████▏ | 1384/1680 [4:40:36<58:46, 11.91s/it] 82%|████████▏ | 1385/1680 [4:40:49<59:27, 12.09s/it] 82%|████████▏ | 1385/1680 [4:40:49<59:27, 12.09s/it] 82%|████████▎ | 1386/1680 [4:41:02<1:00:27, 12.34s/it] 83%|████████▎ | 1387/1680 [4:41:14<1:00:10, 12.32s/it] 83%|████████▎ | 1388/1680 [4:41:26<59:33, 12.24s/it] 83%|████████▎ | 1389/1680 [4:41:38<58:59, 12.16s/it] 83%|████████▎ | 1390/1680 [4:41:50<59:02, 12.22s/it] 83%|████████▎ | 1390/1680 [4:41:50<59:02, 12.22s/it] 83%|████████▎ | 1391/1680 [4:42:02<57:45, 11.99s/it] 83%|████████▎ | 1392/1680 [4:42:14<57:52, 12.06s/it] 83%|████████▎ | 1393/1680 [4:42:26<57:47, 12.08s/it] 83%|████████▎ | 1394/1680 [4:42:39<58:38, 12.30s/it] 83%|████████▎ | 1395/1680 [4:42:52<58:49, 12.38s/it] 83%|████████▎ | 1395/1680 [4:42:52<58:49, 12.38s/it] 83%|████████▎ | 1396/1680 [4:43:05<59:36, 12.59s/it] 83%|████████▎ | 1397/1680 [4:43:17<58:40, 12.44s/it] 83%|████████▎ | 1398/1680 [4:43:29<57:48, 12.30s/it] 83%|████████▎ | 1399/1680 [4:43:41<57:42, 12.32s/it] 83%|████████▎ | 1400/1680 [4:43:53<57:05, 12.23s/it] 83%|████████▎ | 1400/1680 [4:43:53<57:05, 12.23s/it][INFO|trainer.py:3819] 2024-08-13 14:20:21,539 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 14:20:21,539 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 14:20:21,539 >> Batch size = 1 {'eval_loss': 2.2220773696899414, 'eval_runtime': 17.7459, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 4.75} {'loss': 0.031, 'grad_norm': 1.403196930885315, 'learning_rate': 1.230544455246101e-05, 'epoch': 4.77} {'loss': 0.0584, 'grad_norm': 2.3339104652404785, 'learning_rate': 1.1966210947494583e-05, 'epoch': 4.78} {'loss': 0.0633, 'grad_norm': 2.0965840816497803, 'learning_rate': 1.1631082235698316e-05, 'epoch': 4.8} {'loss': 0.0485, 'grad_norm': 1.8118559122085571, 'learning_rate': 1.130009458668863e-05, 'epoch': 4.82} {'loss': 0.0271, 'grad_norm': 1.4843353033065796, 'learning_rate': 1.097328372314721e-05, 'epoch': 4.84} {'loss': 0.0559, 'grad_norm': 2.7621233463287354, 'learning_rate': 1.0650684916965559e-05, 'epoch': 4.85} {'loss': 0.0582, 'grad_norm': 0.8147066831588745, 'learning_rate': 1.0332332985438248e-05, 'epoch': 4.87} {'loss': 0.0965, 'grad_norm': 2.686469316482544, 'learning_rate': 1.0018262287505086e-05, 'epoch': 4.89} {'loss': 0.0565, 'grad_norm': 1.0777071714401245, 'learning_rate': 9.708506720042932e-06, 'epoch': 4.91} {'loss': 0.0542, 'grad_norm': 3.4182119369506836, 'learning_rate': 9.403099714207175e-06, 'epoch': 4.93} {'loss': 0.0856, 'grad_norm': 1.8600770235061646, 'learning_rate': 9.102074231823727e-06, 'epoch': 4.94} {'loss': 0.0524, 'grad_norm': 2.112198829650879, 'learning_rate': 8.805462761831418e-06, 'epoch': 4.96} {'loss': 0.0641, 'grad_norm': 1.6986050605773926, 'learning_rate': 8.513297316775625e-06, 'epoch': 4.98} {'loss': 0.0644, 'grad_norm': 1.5771281719207764, 'learning_rate': 8.225609429353187e-06, 'epoch': 5.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1400 [INFO|configuration_utils.py:733] 2024-08-13 14:20:39,857 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 14:20:39,858 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 14:20:40,649 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1400/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 14:20:40,650 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1400/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 83%|████████▎ | 1401/1680 [4:44:26<1:25:15, 18.34s/it] 83%|████████▎ | 1402/1680 [4:44:38<1:16:54, 16.60s/it] 84%|████████▎ | 1403/1680 [4:44:50<1:10:00, 15.17s/it] 84%|████████▎ | 1404/1680 [4:45:01<1:04:00, 13.91s/it] 84%|████████▎ | 1405/1680 [4:45:13<1:01:25, 13.40s/it] 84%|████████▎ | 1405/1680 [4:45:13<1:01:25, 13.40s/it] 84%|████████▎ | 1406/1680 [4:45:26<1:00:00, 13.14s/it] 84%|████████▍ | 1407/1680 [4:45:37<57:21, 12.60s/it] 84%|████████▍ | 1408/1680 [4:45:49<56:00, 12.36s/it] 84%|████████▍ | 1409/1680 [4:46:01<55:42, 12.33s/it] 84%|████████▍ | 1410/1680 [4:46:12<54:07, 12.03s/it] 84%|████████▍ | 1410/1680 [4:46:12<54:07, 12.03s/it] 84%|████████▍ | 1411/1680 [4:46:25<54:06, 12.07s/it] 84%|████████▍ | 1412/1680 [4:46:37<54:26, 12.19s/it] 84%|████████▍ | 1413/1680 [4:46:48<52:15, 11.74s/it] 84%|████████▍ | 1414/1680 [4:47:00<52:27, 11.83s/it] 84%|████████▍ | 1415/1680 [4:47:11<51:18, 11.62s/it] 84%|████████▍ | 1415/1680 [4:47:11<51:18, 11.62s/it] 84%|████████▍ | 1416/1680 [4:47:23<51:17, 11.66s/it] 84%|████████▍ | 1417/1680 [4:47:35<51:27, 11.74s/it] 84%|████████▍ | 1418/1680 [4:47:47<51:42, 11.84s/it] 84%|████████▍ | 1419/1680 [4:47:59<51:56, 11.94s/it] 85%|████████▍ | 1420/1680 [4:48:11<51:24, 11.86s/it] 85%|████████▍ | 1420/1680 [4:48:11<51:24, 11.86s/it] 85%|████████▍ | 1421/1680 [4:48:23<51:46, 11.99s/it] 85%|████████▍ | 1422/1680 [4:48:35<51:09, 11.90s/it] 85%|████████▍ | 1423/1680 [4:48:46<50:58, 11.90s/it] 85%|████████▍ | 1424/1680 [4:48:58<50:40, 11.88s/it] 85%|████████▍ | 1425/1680 [4:49:10<50:20, 11.84s/it] 85%|████████▍ | 1425/1680 [4:49:10<50:20, 11.84s/it] 85%|████████▍ | 1426/1680 [4:49:23<51:09, 12.09s/it] 85%|████████▍ | 1427/1680 [4:49:34<50:03, 11.87s/it] 85%|████████▌ | 1428/1680 [4:49:45<49:18, 11.74s/it] 85%|████████▌ | 1429/1680 [4:49:58<50:08, 11.99s/it] 85%|████████▌ | 1430/1680 [4:50:10<49:59, 12.00s/it] 85%|████████▌ | 1430/1680 [4:50:10<49:59, 12.00s/it] 85%|████████▌ | 1431/1680 [4:50:23<50:19, 12.13s/it] 85%|████████▌ | 1432/1680 [4:50:34<48:52, 11.82s/it] 85%|████████▌ | 1433/1680 [4:50:45<48:19, 11.74s/it] 85%|████████▌ | 1434/1680 [4:50:57<48:25, 11.81s/it] 85%|████████▌ | 1435/1680 [4:51:09<47:45, 11.69s/it] 85%|████████▌ | 1435/1680 [4:51:09<47:45, 11.69s/it] 85%|████████▌ | 1436/1680 [4:51:20<47:12, 11.61s/it] 86%|████████▌ | 1437/1680 [4:51:32<48:03, 11.87s/it] 86%|████████▌ | 1438/1680 [4:51:45<48:19, 11.98s/it] 86%|████████▌ | 1439/1680 [4:51:56<47:36, 11.85s/it] 86%|████████▌ | 1440/1680 [4:52:08<47:34, 11.89s/it] 86%|████████▌ | 1440/1680 [4:52:08<47:34, 11.89s/it] 86%|████████▌ | 1441/1680 [4:52:21<47:50, 12.01s/it] 86%|████████▌ | 1442/1680 [4:52:32<47:20, 11.93s/it] 86%|████████▌ | 1443/1680 [4:52:44<46:50, 11.86s/it] 86%|████████▌ | 1444/1680 [4:52:55<46:03, 11.71s/it] 86%|████████▌ | 1445/1680 [4:53:07<46:23, 11.84s/it] 86%|████████▌ | 1445/1680 [4:53:07<46:23, 11.84s/it] 86%|████████▌ | 1446/1680 [4:53:20<47:19, 12.14s/it] 86%|████████▌ | 1447/1680 [4:53:32<46:20, 11.93s/it] 86%|████████▌ | 1448/1680 [4:53:43<44:53, 11.61s/it] 86%|████████▋ | 1449/1680 [4:53:54<44:30, 11.56s/it] 86%|████████▋ | 1450/1680 [4:54:06<44:42, 11.66s/it] 86%|████████▋ | 1450/1680 [4:54:06<44:42, 11.66s/it] 86%|████████▋ | 1451/1680 [4:54:18<44:45, 11.73s/it] 86%|████████▋ | 1452/1680 [4:54:30<45:00, 11.84s/it] 86%|████████▋ | 1453/1680 [4:54:43<45:41, 12.08s/it] 87%|████████▋ | 1454/1680 [4:54:54<45:03, 11.96s/it] 87%|████████▋ | 1455/1680 [4:55:06<44:51, 11.96s/it] 87%|████████▋ | 1455/1680 [4:55:06<44:51, 11.96s/it] 87%|████████▋ | 1456/1680 [4:55:18<44:30, 11.92s/it] 87%|████████▋ | 1457/1680 [4:55:30<43:56, 11.82s/it] 87%|████████▋ | 1458/1680 [4:55:42<43:52, 11.86s/it] 87%|████████▋ | 1459/1680 [4:55:54<44:41, 12.13s/it] 87%|████████▋ | 1460/1680 [4:56:06<43:52, 11.97s/it] 87%|████████▋ | 1460/1680 [4:56:06<43:52, 11.97s/it] 87%|████████▋ | 1461/1680 [4:56:18<43:22, 11.88s/it] 87%|████████▋ | 1462/1680 [4:56:29<43:02, 11.85s/it] 87%|████████▋ | 1463/1680 [4:56:42<43:19, 11.98s/it] 87%|████████▋ | 1464/1680 [4:56:53<42:42, 11.86s/it] 87%|████████▋ | 1465/1680 [4:57:06<43:04, 12.02s/it] 87%|████████▋ | 1465/1680 [4:57:06<43:04, 12.02s/it] 87%|████████▋ | 1466/1680 [4:57:18<43:05, 12.08s/it] 87%|████████▋ | 1467/1680 [4:57:30<42:46, 12.05s/it] 87%|████████▋ | 1468/1680 [4:57:42<42:15, 11.96s/it] 87%|████████▋ | 1469/1680 [4:57:53<41:43, 11.86s/it] 88%|████████▊ | 1470/1680 [4:58:06<41:57, 11.99s/it] 88%|████████▊ | 1470/1680 [4:58:06<41:57, 11.99s/it][INFO|trainer.py:3819] 2024-08-13 14:34:34,001 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 14:34:34,001 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 14:34:34,001 >> Batch size = 1 {'eval_loss': 2.2804083824157715, 'eval_runtime': 17.7438, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 5.0} {'loss': 0.0227, 'grad_norm': 0.8998332023620605, 'learning_rate': 7.942430149009161e-06, 'epoch': 5.02} {'loss': 0.0175, 'grad_norm': 0.6817569136619568, 'learning_rate': 7.663790038585793e-06, 'epoch': 5.03} {'loss': 0.0161, 'grad_norm': 0.35046374797821045, 'learning_rate': 7.389719171023857e-06, 'epoch': 5.05} {'loss': 0.0268, 'grad_norm': 0.34632906317710876, 'learning_rate': 7.1202471261170245e-06, 'epoch': 5.07} {'loss': 0.0089, 'grad_norm': 0.5170720219612122, 'learning_rate': 6.855402987319348e-06, 'epoch': 5.09} {'loss': 0.0178, 'grad_norm': 0.4268277585506439, 'learning_rate': 6.595215338606397e-06, 'epoch': 5.1} {'loss': 0.0132, 'grad_norm': 0.7124648094177246, 'learning_rate': 6.339712261390213e-06, 'epoch': 5.12} {'loss': 0.0173, 'grad_norm': 0.5214135050773621, 'learning_rate': 6.088921331488568e-06, 'epoch': 5.14} {'loss': 0.0064, 'grad_norm': 0.3924752473831177, 'learning_rate': 5.8428696161488215e-06, 'epoch': 5.16} {'loss': 0.021, 'grad_norm': 0.33278706669807434, 'learning_rate': 5.601583671126531e-06, 'epoch': 5.18} {'loss': 0.0508, 'grad_norm': 1.2323592901229858, 'learning_rate': 5.365089537819434e-06, 'epoch': 5.19} {'loss': 0.0196, 'grad_norm': 0.3533659875392914, 'learning_rate': 5.133412740456806e-06, 'epoch': 5.21} {'loss': 0.0109, 'grad_norm': 0.837640643119812, 'learning_rate': 4.906578283344759e-06, 'epoch': 5.23} {'loss': 0.0257, 'grad_norm': 0.5542824268341064, 'learning_rate': 4.684610648167503e-06, 'epoch': 5.25} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1470 [INFO|configuration_utils.py:733] 2024-08-13 14:34:52,371 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 14:34:52,371 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 14:34:52,976 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1470/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 14:34:52,977 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1470/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 88%|████████▊ | 1471/1680 [4:58:36<1:01:26, 17.64s/it] 88%|████████▊ | 1472/1680 [4:58:48<55:06, 15.90s/it] 88%|████████▊ | 1473/1680 [4:59:00<51:01, 14.79s/it] 88%|████████▊ | 1474/1680 [4:59:12<47:31, 13.84s/it] 88%|████████▊ | 1475/1680 [4:59:24<45:35, 13.34s/it] 88%|████████▊ | 1475/1680 [4:59:24<45:35, 13.34s/it] 88%|████████▊ | 1476/1680 [4:59:36<44:11, 13.00s/it] 88%|████████▊ | 1477/1680 [4:59:48<42:40, 12.61s/it] 88%|████████▊ | 1478/1680 [5:00:00<41:40, 12.38s/it] 88%|████████▊ | 1479/1680 [5:00:11<39:56, 11.92s/it] 88%|████████▊ | 1480/1680 [5:00:23<40:19, 12.10s/it] 88%|████████▊ | 1480/1680 [5:00:23<40:19, 12.10s/it] 88%|████████▊ | 1481/1680 [5:00:35<39:45, 11.99s/it] 88%|████████▊ | 1482/1680 [5:00:46<38:40, 11.72s/it] 88%|████████▊ | 1483/1680 [5:00:58<39:03, 11.90s/it] 88%|████████▊ | 1484/1680 [5:01:10<38:07, 11.67s/it] 88%|████████▊ | 1485/1680 [5:01:21<37:56, 11.67s/it] 88%|████████▊ | 1485/1680 [5:01:21<37:56, 11.67s/it] 88%|████████▊ | 1486/1680 [5:01:33<38:16, 11.84s/it] 89%|████████▊ | 1487/1680 [5:01:46<38:21, 11.93s/it] 89%|████████▊ | 1488/1680 [5:01:57<37:32, 11.73s/it] 89%|████████▊ | 1489/1680 [5:02:09<37:17, 11.71s/it] 89%|████████▊ | 1490/1680 [5:02:22<38:49, 12.26s/it] 89%|████████▊ | 1490/1680 [5:02:22<38:49, 12.26s/it] 89%|████████▉ | 1491/1680 [5:02:33<37:13, 11.82s/it] 89%|████████▉ | 1492/1680 [5:02:45<37:04, 11.83s/it] 89%|████████▉ | 1493/1680 [5:02:56<36:41, 11.77s/it] 89%|████████▉ | 1494/1680 [5:03:08<36:35, 11.80s/it] 89%|████████▉ | 1495/1680 [5:03:20<35:55, 11.65s/it] 89%|████████▉ | 1495/1680 [5:03:20<35:55, 11.65s/it] 89%|████████▉ | 1496/1680 [5:03:32<36:25, 11.88s/it] 89%|████████▉ | 1497/1680 [5:03:44<36:16, 11.89s/it] 89%|████████▉ | 1498/1680 [5:03:55<35:39, 11.75s/it] 89%|████████▉ | 1499/1680 [5:04:06<34:50, 11.55s/it] 89%|████████▉ | 1500/1680 [5:04:19<35:30, 11.84s/it] 89%|████████▉ | 1500/1680 [5:04:19<35:30, 11.84s/it] 89%|████████▉ | 1501/1680 [5:04:31<35:25, 11.88s/it] 89%|████████▉ | 1502/1680 [5:04:43<35:10, 11.86s/it] 89%|████████▉ | 1503/1680 [5:04:54<34:34, 11.72s/it] 90%|████████▉ | 1504/1680 [5:05:06<34:32, 11.78s/it] 90%|████████▉ | 1505/1680 [5:05:18<34:31, 11.84s/it] 90%|████████▉ | 1505/1680 [5:05:18<34:31, 11.84s/it] 90%|████████▉ | 1506/1680 [5:05:30<34:11, 11.79s/it] 90%|████████▉ | 1507/1680 [5:05:42<34:06, 11.83s/it] 90%|████████▉ | 1508/1680 [5:05:53<33:59, 11.86s/it] 90%|████████▉ | 1509/1680 [5:06:05<33:40, 11.82s/it] 90%|████████▉ | 1510/1680 [5:06:18<34:00, 12.00s/it] 90%|████████▉ | 1510/1680 [5:06:18<34:00, 12.00s/it] 90%|████████▉ | 1511/1680 [5:06:29<33:17, 11.82s/it] 90%|█████████ | 1512/1680 [5:06:40<32:41, 11.68s/it] 90%|█████████ | 1513/1680 [5:06:52<32:35, 11.71s/it] 90%|█████████ | 1514/1680 [5:07:03<31:50, 11.51s/it] 90%|█████████ | 1515/1680 [5:07:15<32:05, 11.67s/it] 90%|█████████ | 1515/1680 [5:07:15<32:05, 11.67s/it] 90%|█████████ | 1516/1680 [5:07:26<31:28, 11.51s/it] 90%|█████████ | 1517/1680 [5:07:39<31:48, 11.71s/it] 90%|█████████ | 1518/1680 [5:07:50<31:29, 11.66s/it] 90%|█████████ | 1519/1680 [5:08:02<31:21, 11.69s/it] 90%|█████████ | 1520/1680 [5:08:14<31:18, 11.74s/it] 90%|█████████ | 1520/1680 [5:08:14<31:18, 11.74s/it] 91%|█████████ | 1521/1680 [5:08:25<30:55, 11.67s/it] 91%|█████████ | 1522/1680 [5:08:38<31:25, 11.93s/it] 91%|█████████ | 1523/1680 [5:08:49<30:31, 11.67s/it] 91%|█████████ | 1524/1680 [5:09:01<30:29, 11.73s/it] 91%|█████████ | 1525/1680 [5:09:12<30:11, 11.69s/it] 91%|█████████ | 1525/1680 [5:09:12<30:11, 11.69s/it] 91%|█████████ | 1526/1680 [5:09:23<29:16, 11.41s/it] 91%|█████████ | 1527/1680 [5:09:35<29:17, 11.49s/it] 91%|█████████ | 1528/1680 [5:09:47<29:29, 11.64s/it] 91%|█████████ | 1529/1680 [5:10:00<30:37, 12.17s/it] 91%|█████████ | 1530/1680 [5:10:12<30:13, 12.09s/it] 91%|█████████ | 1530/1680 [5:10:12<30:13, 12.09s/it] 91%|█████████ | 1531/1680 [5:10:24<30:00, 12.08s/it] 91%|█████████ | 1532/1680 [5:10:36<29:55, 12.13s/it] 91%|█████████▏| 1533/1680 [5:10:49<30:18, 12.37s/it] 91%|█████████▏| 1534/1680 [5:11:02<30:05, 12.37s/it] 91%|█████████▏| 1535/1680 [5:11:13<28:57, 11.98s/it] 91%|█████████▏| 1535/1680 [5:11:13<28:57, 11.98s/it] 91%|█████████▏| 1536/1680 [5:11:24<28:37, 11.92s/it] 91%|█████████▏| 1537/1680 [5:11:36<28:15, 11.86s/it] 92%|█████████▏| 1538/1680 [5:11:48<27:58, 11.82s/it] 92%|█████████▏| 1539/1680 [5:12:00<27:48, 11.83s/it] 92%|█████████▏| 1540/1680 [5:12:12<27:34, 11.82s/it] 92%|█████████▏| 1540/1680 [5:12:12<27:34, 11.82s/it][INFO|trainer.py:3819] 2024-08-13 14:48:40,055 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 14:48:40,055 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 14:48:40,055 >> Batch size = 1 {'eval_loss': 2.559340238571167, 'eval_runtime': 17.7441, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 5.25} {'loss': 0.0159, 'grad_norm': 1.8598344326019287, 'learning_rate': 4.467533791345191e-06, 'epoch': 5.27} {'loss': 0.0117, 'grad_norm': 0.721049427986145, 'learning_rate': 4.255371141448272e-06, 'epoch': 5.28} {'loss': 0.0092, 'grad_norm': 0.5821325778961182, 'learning_rate': 4.048145596668967e-06, 'epoch': 5.3} {'loss': 0.0214, 'grad_norm': 1.502493977546692, 'learning_rate': 3.84587952234991e-06, 'epoch': 5.32} {'loss': 0.0087, 'grad_norm': 2.6551506519317627, 'learning_rate': 3.6485947485702832e-06, 'epoch': 5.34} {'loss': 0.0193, 'grad_norm': 0.7094867825508118, 'learning_rate': 3.4563125677897932e-06, 'epoch': 5.35} {'loss': 0.0181, 'grad_norm': 0.48095235228538513, 'learning_rate': 3.269053732550581e-06, 'epoch': 5.37} {'loss': 0.0184, 'grad_norm': 1.0630472898483276, 'learning_rate': 3.086838453237506e-06, 'epoch': 5.39} {'loss': 0.0083, 'grad_norm': 1.2398452758789062, 'learning_rate': 2.9096863958968268e-06, 'epoch': 5.41} {'loss': 0.0129, 'grad_norm': 1.2862337827682495, 'learning_rate': 2.737616680113758e-06, 'epoch': 5.43} {'loss': 0.0176, 'grad_norm': 1.1500790119171143, 'learning_rate': 2.570647876948895e-06, 'epoch': 5.44} {'loss': 0.026, 'grad_norm': 1.017544150352478, 'learning_rate': 2.408798006933882e-06, 'epoch': 5.46} {'loss': 0.0495, 'grad_norm': 0.36417996883392334, 'learning_rate': 2.252084538126542e-06, 'epoch': 5.48} {'loss': 0.0249, 'grad_norm': 0.6736142039299011, 'learning_rate': 2.100524384225555e-06, 'epoch': 5.5} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1540 [INFO|configuration_utils.py:733] 2024-08-13 14:48:58,387 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 14:48:58,388 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 14:48:58,633 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1540/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 14:48:58,634 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1540/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 92%|█████████▏| 1541/1680 [5:12:42<40:27, 17.46s/it] 92%|█████████▏| 1542/1680 [5:12:55<36:45, 15.98s/it] 92%|█████████▏| 1543/1680 [5:13:06<33:27, 14.65s/it] 92%|█████████▏| 1544/1680 [5:13:18<31:25, 13.86s/it] 92%|█████████▏| 1545/1680 [5:13:31<30:05, 13.38s/it] 92%|█████████▏| 1545/1680 [5:13:31<30:05, 13.38s/it] 92%|█████████▏| 1546/1680 [5:13:43<29:25, 13.18s/it] 92%|█████████▏| 1547/1680 [5:13:55<28:24, 12.82s/it] 92%|█████████▏| 1548/1680 [5:14:07<27:25, 12.47s/it] 92%|█████████▏| 1549/1680 [5:14:19<27:03, 12.39s/it] 92%|█████████▏| 1550/1680 [5:14:30<26:00, 12.01s/it] 92%|█████████▏| 1550/1680 [5:14:30<26:00, 12.01s/it] 92%|█████████▏| 1551/1680 [5:14:43<26:04, 12.13s/it] 92%|█████████▏| 1552/1680 [5:14:55<26:07, 12.25s/it] 92%|█████████▏| 1553/1680 [5:15:07<25:43, 12.15s/it] 92%|█████████▎| 1554/1680 [5:15:19<25:25, 12.11s/it] 93%|█████████▎| 1555/1680 [5:15:31<24:54, 11.96s/it] 93%|█████████▎| 1555/1680 [5:15:31<24:54, 11.96s/it] 93%|█████████▎| 1556/1680 [5:15:43<25:08, 12.17s/it] 93%|█████████▎| 1557/1680 [5:15:56<25:12, 12.29s/it] 93%|█████████▎| 1558/1680 [5:16:08<25:05, 12.34s/it] 93%|█████████▎| 1559/1680 [5:16:20<24:40, 12.24s/it] 93%|█████████▎| 1560/1680 [5:16:33<24:27, 12.23s/it] 93%|█████████▎| 1560/1680 [5:16:33<24:27, 12.23s/it] 93%|█████████▎| 1561/1680 [5:16:44<24:03, 12.13s/it] 93%|█████████▎| 1562/1680 [5:16:56<23:19, 11.86s/it] 93%|█████████▎| 1563/1680 [5:17:07<23:03, 11.82s/it] 93%|█████████▎| 1564/1680 [5:17:20<23:02, 11.91s/it] 93%|█████████▎| 1565/1680 [5:17:31<22:37, 11.80s/it] 93%|█████████▎| 1565/1680 [5:17:31<22:37, 11.80s/it] 93%|█████████▎| 1566/1680 [5:17:43<22:15, 11.72s/it] 93%|█████████▎| 1567/1680 [5:17:55<22:09, 11.76s/it] 93%|█████████▎| 1568/1680 [5:18:07<22:18, 11.95s/it] 93%|█████████▎| 1569/1680 [5:18:19<22:05, 11.94s/it] 93%|█████████▎| 1570/1680 [5:18:30<21:43, 11.85s/it] 93%|█████████▎| 1570/1680 [5:18:30<21:43, 11.85s/it] 94%|█████████▎| 1571/1680 [5:18:43<21:38, 11.91s/it] 94%|█████████▎| 1572/1680 [5:18:55<21:44, 12.08s/it] 94%|█████████▎| 1573/1680 [5:19:07<21:41, 12.16s/it] 94%|█████████▎| 1574/1680 [5:19:20<21:32, 12.20s/it] 94%|█████████▍| 1575/1680 [5:19:31<21:05, 12.05s/it] 94%|█████████▍| 1575/1680 [5:19:31<21:05, 12.05s/it] 94%|█████████▍| 1576/1680 [5:19:43<20:50, 12.03s/it] 94%|█████████▍| 1577/1680 [5:19:55<20:39, 12.03s/it] 94%|█████████▍| 1578/1680 [5:20:07<20:29, 12.05s/it] 94%|█████████▍| 1579/1680 [5:20:20<20:37, 12.26s/it] 94%|█████████▍| 1580/1680 [5:20:31<19:50, 11.90s/it] 94%|█████████▍| 1580/1680 [5:20:31<19:50, 11.90s/it] 94%|█████████▍| 1581/1680 [5:20:44<19:48, 12.01s/it] 94%|█████████▍| 1582/1680 [5:20:55<19:27, 11.92s/it] 94%|█████████▍| 1583/1680 [5:21:08<19:27, 12.04s/it] 94%|█████████▍| 1584/1680 [5:21:20<19:17, 12.06s/it] 94%|█████████▍| 1585/1680 [5:21:32<19:09, 12.10s/it] 94%|█████████▍| 1585/1680 [5:21:32<19:09, 12.10s/it] 94%|█████████▍| 1586/1680 [5:21:44<18:45, 11.97s/it] 94%|█████████▍| 1587/1680 [5:21:55<18:23, 11.86s/it] 95%|█████████▍| 1588/1680 [5:22:07<18:07, 11.83s/it] 95%|█████████▍| 1589/1680 [5:22:19<17:57, 11.84s/it] 95%|█████████▍| 1590/1680 [5:22:31<17:58, 11.99s/it] 95%|█████████▍| 1590/1680 [5:22:31<17:58, 11.99s/it] 95%|█████████▍| 1591/1680 [5:22:43<17:47, 11.99s/it] 95%|█████████▍| 1592/1680 [5:22:55<17:38, 12.03s/it] 95%|█████████▍| 1593/1680 [5:23:07<17:16, 11.92s/it] 95%|█████████▍| 1594/1680 [5:23:19<17:09, 11.98s/it] 95%|█████████▍| 1595/1680 [5:23:31<17:05, 12.07s/it] 95%|█████████▍| 1595/1680 [5:23:31<17:05, 12.07s/it] 95%|█████████▌| 1596/1680 [5:23:44<16:59, 12.14s/it] 95%|█████████▌| 1597/1680 [5:23:55<16:34, 11.98s/it] 95%|█████████▌| 1598/1680 [5:24:06<16:05, 11.78s/it] 95%|█████████▌| 1599/1680 [5:24:19<16:13, 12.01s/it] 95%|█████████▌| 1600/1680 [5:24:32<16:15, 12.19s/it] 95%|█████████▌| 1600/1680 [5:24:32<16:15, 12.19s/it] 95%|█████████▌| 1601/1680 [5:24:44<16:07, 12.25s/it] 95%|█████████▌| 1602/1680 [5:24:56<15:46, 12.14s/it] 95%|█████████▌| 1603/1680 [5:25:07<15:21, 11.97s/it] 95%|█████████▌| 1604/1680 [5:25:20<15:19, 12.10s/it] 96%|█████████▌| 1605/1680 [5:25:32<15:10, 12.14s/it] 96%|█████████▌| 1605/1680 [5:25:32<15:10, 12.14s/it] 96%|█████████▌| 1606/1680 [5:25:44<14:46, 11.99s/it] 96%|█████████▌| 1607/1680 [5:25:55<14:27, 11.89s/it] 96%|█████████▌| 1608/1680 [5:26:07<14:19, 11.94s/it] 96%|█████████▌| 1609/1680 [5:26:19<14:01, 11.85s/it] 96%|█████████▌| 1610/1680 [5:26:32<14:03, 12.05s/it] 96%|█████████▌| 1610/1680 [5:26:32<14:03, 12.05s/it][INFO|trainer.py:3819] 2024-08-13 15:03:00,077 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 15:03:00,077 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 15:03:00,078 >> Batch size = 1 {'eval_loss': 2.6220109462738037, 'eval_runtime': 17.7422, 'eval_samples_per_second': 2.593, 'eval_steps_per_second': 2.593, 'epoch': 5.5} {'loss': 0.0197, 'grad_norm': 0.5455936193466187, 'learning_rate': 1.9541339027450256e-06, 'epoch': 5.52} {'loss': 0.0154, 'grad_norm': 1.3337368965148926, 'learning_rate': 1.8129288932490274e-06, 'epoch': 5.53} {'loss': 0.0192, 'grad_norm': 0.9104143381118774, 'learning_rate': 1.6769245956464396e-06, 'epoch': 5.55} {'loss': 0.0271, 'grad_norm': 0.9945054650306702, 'learning_rate': 1.5461356885461075e-06, 'epoch': 5.57} {'loss': 0.0128, 'grad_norm': 1.1372507810592651, 'learning_rate': 1.4205762876726092e-06, 'epoch': 5.59} {'loss': 0.0167, 'grad_norm': 0.292233943939209, 'learning_rate': 1.3002599443428243e-06, 'epoch': 5.6} {'loss': 0.0196, 'grad_norm': 0.8667420148849487, 'learning_rate': 1.1851996440033319e-06, 'epoch': 5.62} {'loss': 0.0141, 'grad_norm': 0.6354473233222961, 'learning_rate': 1.0754078048289374e-06, 'epoch': 5.64} {'loss': 0.0289, 'grad_norm': 1.5247339010238647, 'learning_rate': 9.708962763824048e-07, 'epoch': 5.66} {'loss': 0.0161, 'grad_norm': 1.2465256452560425, 'learning_rate': 8.716763383355864e-07, 'epoch': 5.68} {'loss': 0.0133, 'grad_norm': 1.1474500894546509, 'learning_rate': 7.777586992519959e-07, 'epoch': 5.69} {'loss': 0.0272, 'grad_norm': 1.2113944292068481, 'learning_rate': 6.891534954310885e-07, 'epoch': 5.71} {'loss': 0.0084, 'grad_norm': 0.8237090706825256, 'learning_rate': 6.058702898142643e-07, 'epoch': 5.73} {'loss': 0.0238, 'grad_norm': 1.4685379266738892, 'learning_rate': 5.279180709527765e-07, 'epoch': 5.75} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1610 [INFO|configuration_utils.py:733] 2024-08-13 15:03:18,391 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 15:03:18,392 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 15:03:18,625 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1610/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 15:03:18,626 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1610/special_tokens_map.json /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 96%|█████████▌| 1611/1680 [5:27:04<20:48, 18.09s/it] 96%|█████████▌| 1612/1680 [5:27:16<18:28, 16.30s/it] 96%|█████████▌| 1613/1680 [5:27:28<16:46, 15.02s/it] 96%|█████████▌| 1614/1680 [5:27:40<15:26, 14.05s/it] 96%|█████████▌| 1615/1680 [5:27:52<14:34, 13.46s/it] 96%|█████████▌| 1615/1680 [5:27:52<14:34, 13.46s/it] 96%|█████████▌| 1616/1680 [5:28:04<13:51, 12.99s/it] 96%|█████████▋| 1617/1680 [5:28:15<13:11, 12.56s/it] 96%|█████████▋| 1618/1680 [5:28:27<12:40, 12.27s/it] 96%|█████████▋| 1619/1680 [5:28:38<12:16, 12.08s/it] 96%|█████████▋| 1620/1680 [5:28:50<11:50, 11.84s/it] 96%|█████████▋| 1620/1680 [5:28:50<11:50, 11.84s/it] 96%|█████████▋| 1621/1680 [5:29:02<11:36, 11.81s/it] 97%|█████████▋| 1622/1680 [5:29:12<11:10, 11.56s/it] 97%|█████████▋| 1623/1680 [5:29:25<11:10, 11.77s/it] 97%|█████████▋| 1624/1680 [5:29:37<11:05, 11.88s/it] 97%|█████████▋| 1625/1680 [5:29:50<11:15, 12.29s/it] 97%|█████████▋| 1625/1680 [5:29:50<11:15, 12.29s/it] 97%|█████████▋| 1626/1680 [5:30:02<10:59, 12.21s/it] 97%|█████████▋| 1627/1680 [5:30:14<10:41, 12.10s/it] 97%|█████████▋| 1628/1680 [5:30:25<10:12, 11.77s/it] 97%|█████████▋| 1629/1680 [5:30:36<09:55, 11.68s/it] 97%|█████████▋| 1630/1680 [5:30:48<09:43, 11.66s/it] 97%|█████████▋| 1630/1680 [5:30:48<09:43, 11.66s/it] 97%|█████████▋| 1631/1680 [5:31:00<09:40, 11.84s/it] 97%|█████████▋| 1632/1680 [5:31:12<09:22, 11.73s/it] 97%|█████████▋| 1633/1680 [5:31:23<09:06, 11.63s/it] 97%|█████████▋| 1634/1680 [5:31:34<08:48, 11.48s/it] 97%|█████████▋| 1635/1680 [5:31:47<08:46, 11.69s/it] 97%|█████████▋| 1635/1680 [5:31:47<08:46, 11.69s/it] 97%|█████████▋| 1636/1680 [5:31:58<08:32, 11.65s/it] 97%|█████████▋| 1637/1680 [5:32:10<08:24, 11.74s/it] 98%|█████████▊| 1638/1680 [5:32:23<08:25, 12.03s/it] 98%|█████████▊| 1639/1680 [5:32:35<08:14, 12.06s/it] 98%|█████████▊| 1640/1680 [5:32:46<07:55, 11.88s/it] 98%|█████████▊| 1640/1680 [5:32:46<07:55, 11.88s/it] 98%|█████████▊| 1641/1680 [5:32:58<07:36, 11.72s/it] 98%|█████████▊| 1642/1680 [5:33:10<07:28, 11.79s/it] 98%|█████████▊| 1643/1680 [5:33:21<07:08, 11.57s/it] 98%|█████████▊| 1644/1680 [5:33:33<07:03, 11.76s/it] 98%|█████████▊| 1645/1680 [5:33:44<06:41, 11.47s/it] 98%|█████████▊| 1645/1680 [5:33:44<06:41, 11.47s/it] 98%|█████████▊| 1646/1680 [5:33:56<06:35, 11.64s/it] 98%|█████████▊| 1647/1680 [5:34:08<06:27, 11.76s/it] 98%|█████████▊| 1648/1680 [5:34:20<06:21, 11.93s/it] 98%|█████████▊| 1649/1680 [5:34:32<06:10, 11.96s/it] 98%|█████████▊| 1650/1680 [5:34:44<06:01, 12.04s/it] 98%|█████████▊| 1650/1680 [5:34:44<06:01, 12.04s/it] 98%|█████████▊| 1651/1680 [5:34:56<05:50, 12.07s/it] 98%|█████████▊| 1652/1680 [5:35:09<05:39, 12.11s/it] 98%|█████████▊| 1653/1680 [5:35:20<05:18, 11.78s/it] 98%|█████████▊| 1654/1680 [5:35:32<05:10, 11.94s/it] 99%|█████████▊| 1655/1680 [5:35:44<04:56, 11.87s/it] 99%|█████████▊| 1655/1680 [5:35:44<04:56, 11.87s/it] 99%|█████████▊| 1656/1680 [5:35:57<04:52, 12.17s/it] 99%|█████████▊| 1657/1680 [5:36:09<04:40, 12.18s/it] 99%|█████████▊| 1658/1680 [5:36:21<04:29, 12.25s/it] 99%|█████████▉| 1659/1680 [5:36:33<04:14, 12.12s/it] 99%|█████████▉| 1660/1680 [5:36:45<04:03, 12.15s/it] 99%|█████████▉| 1660/1680 [5:36:45<04:03, 12.15s/it] 99%|█████████▉| 1661/1680 [5:36:57<03:47, 11.99s/it] 99%|█████████▉| 1662/1680 [5:37:08<03:30, 11.71s/it] 99%|█████████▉| 1663/1680 [5:37:19<03:17, 11.61s/it] 99%|█████████▉| 1664/1680 [5:37:31<03:04, 11.53s/it] 99%|█████████▉| 1665/1680 [5:37:42<02:54, 11.61s/it] 99%|█████████▉| 1665/1680 [5:37:42<02:54, 11.61s/it] 99%|█████████▉| 1666/1680 [5:37:54<02:42, 11.63s/it] 99%|█████████▉| 1667/1680 [5:38:06<02:31, 11.65s/it] 99%|█████████▉| 1668/1680 [5:38:18<02:21, 11.79s/it] 99%|█████████▉| 1669/1680 [5:38:29<02:08, 11.71s/it] 99%|█████████▉| 1670/1680 [5:38:42<01:59, 11.93s/it] 99%|█████████▉| 1670/1680 [5:38:42<01:59, 11.93s/it] 99%|█████████▉| 1671/1680 [5:38:53<01:46, 11.80s/it] 100%|█████████▉| 1672/1680 [5:39:05<01:35, 11.88s/it] 100%|█████████▉| 1673/1680 [5:39:17<01:23, 11.88s/it] 100%|█████████▉| 1674/1680 [5:39:30<01:11, 11.99s/it] 100%|█████████▉| 1675/1680 [5:39:42<01:01, 12.28s/it] 100%|█████████▉| 1675/1680 [5:39:42<01:01, 12.28s/it] 100%|█████████▉| 1676/1680 [5:39:55<00:48, 12.24s/it] 100%|█████████▉| 1677/1680 [5:40:07<00:36, 12.20s/it] 100%|█████████▉| 1678/1680 [5:40:19<00:24, 12.20s/it] 100%|█████████▉| 1679/1680 [5:40:30<00:11, 11.94s/it] 100%|██████████| 1680/1680 [5:40:42<00:00, 11.93s/it] 100%|██████████| 1680/1680 [5:40:42<00:00, 11.93s/it][INFO|trainer.py:3819] 2024-08-13 15:17:10,667 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 15:17:10,667 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 15:17:10,667 >> Batch size = 1 {'eval_loss': 2.618908643722534, 'eval_runtime': 17.7484, 'eval_samples_per_second': 2.592, 'eval_steps_per_second': 2.592, 'epoch': 5.75} {'loss': 0.0246, 'grad_norm': 1.3691818714141846, 'learning_rate': 4.553052520375911e-07, 'epoch': 5.77} {'loss': 0.0098, 'grad_norm': 0.16103225946426392, 'learning_rate': 3.8803966999139684e-07, 'epoch': 5.78} {'loss': 0.0179, 'grad_norm': 1.2005606889724731, 'learning_rate': 3.261285846227868e-07, 'epoch': 5.8} {'loss': 0.0139, 'grad_norm': 0.3046216070652008, 'learning_rate': 2.6957867784270787e-07, 'epoch': 5.82} {'loss': 0.0117, 'grad_norm': 0.48873192071914673, 'learning_rate': 2.1839605294330933e-07, 'epoch': 5.84} {'loss': 0.0191, 'grad_norm': 0.46313759684562683, 'learning_rate': 1.725862339392259e-07, 'epoch': 5.85} {'loss': 0.0147, 'grad_norm': 0.34420424699783325, 'learning_rate': 1.3215416497138754e-07, 'epoch': 5.87} {'loss': 0.014, 'grad_norm': 0.7899921536445618, 'learning_rate': 9.710420977340762e-08, 'epoch': 5.89} {'loss': 0.0146, 'grad_norm': 0.9719728827476501, 'learning_rate': 6.744015120061509e-08, 'epoch': 5.91} {'loss': 0.0223, 'grad_norm': 0.30600279569625854, 'learning_rate': 4.316519082179227e-08, 'epoch': 5.93} {'loss': 0.0112, 'grad_norm': 0.7548239231109619, 'learning_rate': 2.4281948573617874e-08, 'epoch': 5.94} {'loss': 0.0273, 'grad_norm': 0.9639114141464233, 'learning_rate': 1.0792462477909882e-08, 'epoch': 5.96} {'loss': 0.0282, 'grad_norm': 2.499755382537842, 'learning_rate': 2.6981884216847884e-09, 'epoch': 5.98} {'loss': 0.0262, 'grad_norm': 2.1217262744903564, 'learning_rate': 0.0, 'epoch': 6.0} 0%| | 0/46 [00:00> Saving model checkpoint to saves/Qwen2-72B-Instruct/checkpoint-1680 [INFO|configuration_utils.py:733] 2024-08-13 15:17:29,154 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 15:17:29,154 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 15:17:30,417 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/checkpoint-1680/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 15:17:30,418 >> Special tokens file saved in saves/Qwen2-72B-Instruct/checkpoint-1680/special_tokens_map.json [INFO|trainer.py:2394] 2024-08-13 15:17:32,279 >> Training completed. Do not forget to share your model on huggingface.co/models =) 100%|██████████| 1680/1680 [5:41:04<00:00, 11.93s/it] 100%|██████████| 1680/1680 [5:41:04<00:00, 12.18s/it] [INFO|trainer.py:3503] 2024-08-13 15:17:32,282 >> Saving model checkpoint to saves/Qwen2-72B-Instruct [INFO|configuration_utils.py:733] 2024-08-13 15:17:32,811 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--Qwen--Qwen2-72B-Instruct/snapshots/1af63c698f59c4235668ec9c1395468cb7cd7e79/config.json [INFO|configuration_utils.py:800] 2024-08-13 15:17:32,811 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 29568, "max_position_embeddings": 32768, "max_window_layers": 80, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2702] 2024-08-13 15:17:34,158 >> tokenizer config file saved in saves/Qwen2-72B-Instruct/tokenizer_config.json [INFO|tokenization_utils_base.py:2711] 2024-08-13 15:17:34,158 >> Special tokens file saved in saves/Qwen2-72B-Instruct/special_tokens_map.json {'eval_loss': 2.630276679992676, 'eval_runtime': 17.7504, 'eval_samples_per_second': 2.591, 'eval_steps_per_second': 2.591, 'epoch': 6.0} {'train_runtime': 20472.2071, 'train_samples_per_second': 1.314, 'train_steps_per_second': 0.082, 'train_loss': 0.5969825589208908, 'epoch': 6.0} ***** train metrics ***** epoch = 5.9973 total_flos = 1647300902GF train_loss = 0.597 train_runtime = 5:41:12.20 train_samples_per_second = 1.314 train_steps_per_second = 0.082 Figure saved at: saves/Qwen2-72B-Instruct/training_loss.png Figure saved at: saves/Qwen2-72B-Instruct/training_eval_loss.png 08/13/2024 15:17:34 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot. [INFO|trainer.py:3819] 2024-08-13 15:17:34,736 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-08-13 15:17:34,737 >> Num examples = 46 [INFO|trainer.py:3824] 2024-08-13 15:17:34,737 >> Batch size = 1 0%| | 0/46 [00:00> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}} ***** eval metrics ***** epoch = 5.9973 eval_loss = 2.6303 eval_runtime = 0:00:17.63 eval_samples_per_second = 2.609 eval_steps_per_second = 2.609 wandb: - 0.014 MB of 0.014 MB uploaded wandb: \ 0.020 MB of 0.115 MB uploaded wandb: | 0.115 MB of 0.115 MB uploaded wandb: wandb: Run history: wandb: eval/loss ▂▁▁▁▁▁▁▁▂▂▂▂▄▄▄▄▆▆▆▆█████ wandb: eval/runtime █▇██▇▇▇█▇▇▇▇▇█▇▇██▇▇▇▇██▁ wandb: eval/samples_per_second ▁▁▁▁▂▁▂▁▂▂▁▂▂▁▂▂▁▁▁▁▁▂▁▁█ wandb: eval/steps_per_second ▁▁▁▁▂▁▂▁▂▂▁▂▂▁▂▂▁▁▁▁▁▂▁▁█ wandb: train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: train/grad_norm █▂▂▂▂▂▂▃▄▄▄▄▃▅▆▇▆▅▆▆█▇▇▇▆▇▆▅█▃▄▅▇▁▁▂▂▂▁▅ wandb: train/learning_rate ▂▄▅▇██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁ wandb: train/loss █▄▄▄▄▄▄▄▄▄▃▃▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: wandb: Run summary: wandb: eval/loss 2.63028 wandb: eval/runtime 17.6344 wandb: eval/samples_per_second 2.609 wandb: eval/steps_per_second 2.609 wandb: total_flos 1.7687758754493235e+18 wandb: train/epoch 5.99732 wandb: train/global_step 1680 wandb: train/grad_norm 2.12173 wandb: train/learning_rate 0.0 wandb: train/loss 0.0262 wandb: train_loss 0.59698 wandb: train_runtime 20472.2071 wandb: train_samples_per_second 1.314 wandb: train_steps_per_second 0.082 wandb: wandb: 🚀 View run Qwen2-72B-Instruct_lora_sft at: https://wandb.ai/inflaton-ai/huggingface/runs/8lhczcch wandb: ⭐️ View project at: https://wandb.ai/inflaton-ai/huggingface wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240813_093621-8lhczcch/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. Current Directory: /common/home/users/d/dh.huang.2023/code/rapget-translation Evaluating Qwen/Qwen2-72B-Instruct [nltk_data] Downloading package wordnet to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /common/home/users/d/dh.huang.2023/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! loading env vars from: /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/.env workding dir: /common/home/users/d/dh.huang.2023/common2/code/rapget-translation adding /common/home/users/d/dh.huang.2023/common2/code/rapget-translation to sys.path loading: /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/eval_modules/calc_repetitions.py loading /common/home/users/d/dh.huang.2023/common2/code/rapget-translation/llm_toolkit/translation_utils.py Qwen/Qwen2-72B-Instruct llama-factory/saves/Qwen2-72B-Instruct True 1 results/mac-results_fine_tuned.csv CUDA is available, we have found 1 GPU(s) NVIDIA H100 PCIe CUDA version: 12.1 Evaluating model: Qwen/Qwen2-72B-Instruct on cuda (0) GPU = NVIDIA H100 PCIe. Max memory = 79.097 GB. 0.0 GB of memory reserved. loading model: Qwen/Qwen2-72B-Instruct with adapter: None Loading checkpoint shards: 0%| | 0/37 [00:00system You are a helpful assistant that translates Chinese to English.<|im_end|> <|im_start|>user You will be given a Chinese sentence to translate. If it is an incomplete sentence, or if you are unsure about the meaning, simply copy the input text as your output. Do not output any additional sentence such as explanation or reasoning. Chinese: 老耿端起枪,眯缝起一只三角眼,一搂扳机响了枪,冰雹般的金麻雀劈哩啪啦往下落,铁砂子在柳枝间飞迸着,嚓嚓有声。 English:<|im_end|> <|im_start|>assistant Old Geng picked up his shotgun, squinted, and pulled the trigger. Two sparrows crashed to the ground like hailstones as shotgun pellets tore noisily through the branches.<|im_end|> -------------------------------------------------- prompt: <|im_start|>system You are a helpful assistant that translates Chinese to English.<|im_end|> <|im_start|>user You will be given a Chinese sentence to translate. If it is an incomplete sentence, or if you are unsure about the meaning, simply copy the input text as your output. Do not output any additional sentence such as explanation or reasoning. Chinese: 老耿端起枪,眯缝起一只三角眼,一搂扳机响了枪,冰雹般的金麻雀劈哩啪啦往下落,铁砂子在柳枝间飞迸着,嚓嚓有声。 English:<|im_end|> <|im_start|>assistant (1) GPU = NVIDIA H100 PCIe. Max memory = 79.097 GB. 43.477 GB of memory reserved. found 24 checkpoints: ['checkpoint-70', 'checkpoint-140', 'checkpoint-210', 'checkpoint-280', 'checkpoint-350', 'checkpoint-420', 'checkpoint-490', 'checkpoint-560', 'checkpoint-630', 'checkpoint-700', 'checkpoint-770', 'checkpoint-840', 'checkpoint-910', 'checkpoint-980', 'checkpoint-1050', 'checkpoint-1120', 'checkpoint-1190', 'checkpoint-1260', 'checkpoint-1330', 'checkpoint-1400', 'checkpoint-1470', 'checkpoint-1540', 'checkpoint-1610', 'checkpoint-1680'] Running from epoch 1 to 24 Epoch 1 loading adapter: llama-factory/saves/Qwen2-72B-Instruct/checkpoint-70 0%| | 0/1133 [00:00