Lmod has detected the following error: The following module(s) are unknown: "CUDA/12.1.1" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "CUDA/12.1.1" Also make sure that all modulefiles written in TCL start with the string #%Module Submitting job: /common/home/users/d/dh.huang.2023/code/logical-reasoning/scripts/tune-mgtv.sh Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning Sat Jul 13 10:40:00 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 PCIe On | 00000000:01:00.0 Off | 0 | | N/A 42C P0 51W / 350W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Linux holiday 4.18.0-553.5.1.el8_10.x86_64 #1 SMP Thu Jun 6 09:41:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux NAME="Rocky Linux" VERSION="8.10 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" SUPPORT_END="2029-05-31" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.10" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.10" Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9554 64-Core Processor Stepping: 1 CPU MHz: 3100.000 CPU max MHz: 3762.9880 CPU min MHz: 1500.0000 BogoMIPS: 6190.80 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d MemTotal: 527521124 kB Tuning with config/internlm2_5_7b_lora_sft_bf16_p1_full.yaml Current Directory: /common/home/users/d/dh.huang.2023/code/logical-reasoning/llama-factory config/internlm2_5_7b_lora_sft_bf16_p1_full.yaml: { "model_name_or_path": "internlm/internlm2_5-7b-chat-1m", "stage": "sft", "do_train": true, "finetuning_type": "lora", "lora_target": "all", "loraplus_lr_ratio": 16.0, "upcast_layernorm": true, "dataset": "alpaca_mgtv_p1", "template": "intern2", "cutoff_len": 4096, "max_samples": 25000, "overwrite_cache": true, "preprocessing_num_workers": 16, "output_dir": "saves/internlm2_5_7b/lora/sft_bf16_p1_full", "logging_steps": 10, "save_steps": 44, "plot_loss": true, "overwrite_output_dir": true, "per_device_train_batch_size": 64, "gradient_accumulation_steps": 8, "learning_rate": 0.0001, "num_train_epochs": 6.0, "lr_scheduler_type": "cosine", "warmup_ratio": 0.1, "bf16": true, "ddp_timeout": 180000000, "val_size": 0.1, "per_device_eval_batch_size": 1, "eval_strategy": "steps", "eval_steps": 44, "report_to": "wandb", "run_name": "internlm2_5_7b_p1_h100" } 07/13/2024 10:40:09 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2108] 2024-07-13 10:40:10,014 >> loading file ./tokenizer.model from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/./tokenizer.model [INFO|tokenization_utils_base.py:2108] 2024-07-13 10:40:10,014 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2108] 2024-07-13 10:40:10,014 >> loading file special_tokens_map.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/special_tokens_map.json [INFO|tokenization_utils_base.py:2108] 2024-07-13 10:40:10,014 >> loading file tokenizer_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/tokenizer_config.json [INFO|tokenization_utils_base.py:2108] 2024-07-13 10:40:10,014 >> loading file tokenizer.json from cache at None 07/13/2024 10:40:11 - INFO - llamafactory.data.template - Add <|im_end|> to stop words. 07/13/2024 10:40:11 - INFO - llamafactory.data.loader - Loading dataset alpaca_mgtv_p1.json... Converting format of dataset (num_proc=16): 0%| | 0/25000 [00:00> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:733] 2024-07-13 10:40:15,862 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:796] 2024-07-13 10:40:15,863 >> Model config InternLM2Config { "_name_or_path": "internlm/internlm2_5-7b-chat-1m", "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "eager", "auto_map": { "AutoConfig": "internlm/internlm2_5-7b-chat-1m--configuration_internlm2.InternLM2Config", "AutoModel": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 262144, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.5, "type": "dynamic" }, "rope_theta": 50000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 92544 } [INFO|modeling_utils.py:3474] 2024-07-13 10:40:16,158 >> loading weights file model.safetensors from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-07-13 10:40:16,159 >> Instantiating InternLM2ForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:962] 2024-07-13 10:40:16,160 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } input_ids: [1, 92543, 1008, 364, 60403, 68625, 70503, 68309, 69323, 60687, 60364, 60355, 68309, 69776, 68411, 60387, 402, 312, 281, 262, 69102, 60497, 60382, 89428, 63352, 60582, 60355, 364, 314, 281, 262, 69102, 60497, 70357, 73912, 60383, 69464, 75687, 60353, 69702, 84931, 63352, 60582, 60355, 364, 308, 281, 262, 68390, 68772, 68287, 60353, 74243, 60530, 68420, 74740, 68855, 68544, 72719, 68423, 68538, 60387, 60357, 60359, 68278, 60359, 82568, 60359, 68855, 69077, 60359, 60593, 60408, 69583, 60355, 364, 319, 281, 262, 68855, 60366, 68336, 68535, 68574, 69344, 68347, 60353, 71452, 81256, 68423, 68322, 78818, 60666, 60355, 69192, 60353, 73263, 60581, 60419, 68278, 60420, 81256, 60397, 60419, 60358, 60420, 60355, 364, 317, 281, 262, 69102, 60497, 68266, 68420, 68855, 60383, 76681, 60353, 60573, 68985, 75326, 63352, 80931, 69077, 69059, 60355, 402, 60836, 86910, 68374, 69776, 68855, 69102, 60497, 74743, 68287, 60355, 402, 63352, 60582, 334, 262, 60361, 63840, 60396, 78165, 60353, 68935, 79406, 70952, 60387, 69731, 71150, 88982, 82620, 60353, 71150, 61329, 60425, 60649, 68935, 69410, 71150, 60382, 60358, 62273, 60458, 61217, 60353, 71479, 60400, 72593, 69380, 79594, 90209, 60355, 60836, 75326, 71150, 82066, 79202, 68540, 60355, 402, 74740, 334, 262, 73687, 69607, 60510, 70226, 60372, 62650, 60354, 61044, 61066, 69045, 60355, 71389, 61044, 61066, 89463, 60353, 61002, 60510, 70226, 73027, 70134, 60544, 61422, 60355, 68310, 74907, 60361, 71150, 88982, 82620, 68980, 60355, 69104, 60353, 71062, 61976, 60364, 60353, 70134, 60361, 72325, 60463, 68294, 60612, 70623, 60366, 60877, 60668, 60355, 74726, 60354, 61044, 61066, 68394, 70367, 60447, 69126, 70134, 60353, 69731, 68549, 60530, 69410, 71150, 61882, 60825, 60353, 70395, 70134, 60354, 62296, 60463, 60353, 72069, 86407, 68304, 63024, 60880, 60355, 68597, 68891, 73936, 60362, 69372, 60353, 71093, 72276, 60425, 68252, 82569, 70952, 60355, 402, 69102, 60497, 74743, 68287, 334, 262, 61882, 68279, 60548, 60780, 61076, 364, 92542, 364, 92543, 525, 11353, 364, 68278, 2] inputs: <|im_start|>user 你是一个逻辑游戏的主持人。游戏规则如下: 1. 参与者会得到一个谜题。 2. 参与者可以通过提问来获取线索,尝试解开谜题。 3. 对于每个问题,主持人将根据实际情况回答以下五个选项之一:是、不是、不重要、回答正确、问法错误。 4. 回答中不能添加任何其它信息,也不能省略选项中的任何一个字。例如,不可以把“不是”省略成“不”。 5. 参与者需要根据回答来推理,并最终找出谜题的正确答案。 请严格按照这些规则回答参与者提出的问题。 谜题: 在甄家村里,有一个古老的传说:每年南瓜丰收的季节,南瓜田里总有一个最大的南瓜会不翼而飞,村民们对此现象困惑不解。请找出南瓜失踪背后的原因。 实际情况: 真相原来与一位年迈的农夫有关。这位农夫年轻时,曾与一位美丽的姑娘相恋。他们约定在南瓜丰收的季节结婚。然而,命运弄人,姑娘在婚礼前的一场意外中离世。悲伤的农夫为了纪念心爱的姑娘,每年都会将最大的南瓜偷走,放到姑娘的墓前,以此寄托自己的哀思。这一行为延续了多年,成为了乡村里一个神秘的传说。 参与者提出的问题: 偷的人信神吗 <|im_end|> <|im_start|>assistant 不是 label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 68278, 2] labels: 不是 Loading checkpoint shards: 0%| | 0/8 [00:00> All model checkpoint weights were used when initializing InternLM2ForCausalLM. [INFO|modeling_utils.py:4288] 2024-07-13 10:40:24,414 >> All the weights of InternLM2ForCausalLM were initialized from the model checkpoint at internlm/internlm2_5-7b-chat-1m. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternLM2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:917] 2024-07-13 10:40:24,657 >> loading configuration file generation_config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/generation_config.json [INFO|configuration_utils.py:962] 2024-07-13 10:40:24,657 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": [ 2, 92542 ], "pad_token_id": 2 } 07/13/2024 10:40:24 - INFO - llamafactory.model.model_utils.checkpointing - Upcasting layernorm weights in float32. 07/13/2024 10:40:24 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 07/13/2024 10:40:24 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 07/13/2024 10:40:24 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 07/13/2024 10:40:24 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 07/13/2024 10:40:24 - INFO - llamafactory.model.model_utils.misc - Found linear modules: wqkv,w1,w3,wo,w2 07/13/2024 10:40:24 - INFO - llamafactory.model.loader - trainable params: 18,874,368 || all params: 7,756,582,912 || trainable%: 0.2433 Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:641] 2024-07-13 10:40:24,975 >> Using auto half precision backend 07/13/2024 10:40:25 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. [INFO|trainer.py:2078] 2024-07-13 10:40:25,194 >> ***** Running training ***** [INFO|trainer.py:2079] 2024-07-13 10:40:25,194 >> Num examples = 22,500 [INFO|trainer.py:2080] 2024-07-13 10:40:25,194 >> Num Epochs = 6 [INFO|trainer.py:2081] 2024-07-13 10:40:25,194 >> Instantaneous batch size per device = 64 [INFO|trainer.py:2084] 2024-07-13 10:40:25,195 >> Total train batch size (w. parallel, distributed & accumulation) = 512 [INFO|trainer.py:2085] 2024-07-13 10:40:25,195 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2086] 2024-07-13 10:40:25,195 >> Total optimization steps = 264 [INFO|trainer.py:2087] 2024-07-13 10:40:25,196 >> Number of trainable parameters = 18,874,368 [INFO|integration_utils.py:723] 2024-07-13 10:40:25,198 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: inflaton-sg (inflaton-ai). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.17.4 wandb: Run data is saved locally in /common2/dh.huang.2023/code/logical-reasoning/llama-factory/wandb/run-20240713_104026-5gfamdui wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run internlm2_5_7b_p1_h100 wandb: ⭐️ View project at https://wandb.ai/inflaton-ai/huggingface wandb: 🚀 View run at https://wandb.ai/inflaton-ai/huggingface/runs/5gfamdui 0%| | 0/264 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-13 11:18:32,961 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-13 11:18:32,962 >> Batch size = 1 {'loss': 5.5717, 'grad_norm': 3.057248115539551, 'learning_rate': 3.7037037037037037e-05, 'epoch': 0.23} {'loss': 0.3958, 'grad_norm': 0.5911761522293091, 'learning_rate': 7.407407407407407e-05, 'epoch': 0.45} {'loss': 0.3145, 'grad_norm': 0.49433770775794983, 'learning_rate': 9.996046986136509e-05, 'epoch': 0.68} {'loss': 0.2876, 'grad_norm': 0.4638547897338867, 'learning_rate': 9.925944931706173e-05, 'epoch': 0.91} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-44 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-13 11:20:46,132 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:796] 2024-07-13 11:20:46,133 >> Model config InternLM2Config { "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "eager", "auto_map": { "AutoConfig": "internlm/internlm2_5-7b-chat-1m--configuration_internlm2.InternLM2Config", "AutoModel": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 262144, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.5, "type": "dynamic" }, "rope_theta": 50000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 92544 } [INFO|tokenization_utils_base.py:2513] 2024-07-13 11:20:46,359 >> tokenizer config file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-44/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-13 11:20:46,360 >> Special tokens file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-44/special_tokens_map.json 17%|█▋ | 45/264 [41:06<5:34:03, 91.52s/it] 17%|█▋ | 46/264 [41:59<4:49:58, 79.81s/it] 18%|█▊ | 47/264 [42:51<4:19:09, 71.66s/it] 18%|█▊ | 48/264 [43:44<3:57:20, 65.93s/it] 19%|█▊ | 49/264 [44:35<3:40:15, 61.47s/it] 19%|█▉ | 50/264 [45:26<3:28:17, 58.40s/it] 19%|█▉ | 50/264 [45:26<3:28:17, 58.40s/it] 19%|█▉ | 51/264 [46:18<3:20:09, 56.38s/it] 20%|█▉ | 52/264 [47:10<3:14:29, 55.04s/it] 20%|██ | 53/264 [48:01<3:09:26, 53.87s/it] 20%|██ | 54/264 [48:53<3:06:58, 53.42s/it] 21%|██ | 55/264 [49:45<3:04:42, 53.03s/it] 21%|██ | 56/264 [50:37<3:02:09, 52.54s/it] 22%|██▏ | 57/264 [51:28<3:00:26, 52.30s/it] 22%|██▏ | 58/264 [52:20<2:59:04, 52.16s/it] 22%|██▏ | 59/264 [53:13<2:58:52, 52.35s/it] 23%|██▎ | 60/264 [54:05<2:57:11, 52.11s/it] 23%|██▎ | 60/264 [54:05<2:57:11, 52.11s/it] 23%|██▎ | 61/264 [54:56<2:55:17, 51.81s/it] 23%|██▎ | 62/264 [55:48<2:54:25, 51.81s/it] 24%|██▍ | 63/264 [56:39<2:53:07, 51.68s/it] 24%|██▍ | 64/264 [57:31<2:52:26, 51.73s/it] 25%|██▍ | 65/264 [58:22<2:50:56, 51.54s/it] 25%|██▌ | 66/264 [59:14<2:50:28, 51.66s/it] 25%|██▌ | 67/264 [1:00:05<2:49:38, 51.67s/it] 26%|██▌ | 68/264 [1:00:56<2:47:41, 51.33s/it] 26%|██▌ | 69/264 [1:01:49<2:48:06, 51.73s/it] 27%|██▋ | 70/264 [1:02:40<2:46:45, 51.57s/it] 27%|██▋ | 70/264 [1:02:40<2:46:45, 51.57s/it] 27%|██▋ | 71/264 [1:03:32<2:46:30, 51.77s/it] 27%|██▋ | 72/264 [1:04:23<2:44:53, 51.53s/it] 28%|██▊ | 73/264 [1:05:15<2:44:37, 51.71s/it] 28%|██▊ | 74/264 [1:06:07<2:43:44, 51.71s/it] 28%|██▊ | 75/264 [1:06:58<2:42:23, 51.55s/it] 29%|██▉ | 76/264 [1:07:49<2:40:49, 51.33s/it] 29%|██▉ | 77/264 [1:08:42<2:41:30, 51.82s/it] 30%|██▉ | 78/264 [1:09:34<2:41:14, 52.01s/it] 30%|██▉ | 79/264 [1:10:26<2:39:45, 51.82s/it] 30%|███ | 80/264 [1:11:18<2:39:42, 52.08s/it] 30%|███ | 80/264 [1:11:18<2:39:42, 52.08s/it] 31%|███ | 81/264 [1:12:09<2:37:45, 51.72s/it] 31%|███ | 82/264 [1:13:01<2:37:00, 51.76s/it] 31%|███▏ | 83/264 [1:13:53<2:35:53, 51.68s/it] 32%|███▏ | 84/264 [1:14:44<2:35:10, 51.72s/it] 32%|███▏ | 85/264 [1:15:37<2:35:02, 51.97s/it] 33%|███▎ | 86/264 [1:16:28<2:33:37, 51.78s/it] 33%|███▎ | 87/264 [1:17:21<2:33:18, 51.97s/it] 33%|███▎ | 88/264 [1:18:10<2:29:49, 51.07s/it][INFO|trainer.py:3719] 2024-07-13 11:58:42,733 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-13 11:58:42,734 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-13 11:58:42,734 >> Batch size = 1 {'eval_loss': 0.26385805010795593, 'eval_accuracy': 0.8948, 'eval_runtime': 132.5476, 'eval_samples_per_second': 18.861, 'eval_steps_per_second': 18.861, 'epoch': 1.0} {'loss': 0.265, 'grad_norm': 1.1651757955551147, 'learning_rate': 9.769414454563615e-05, 'epoch': 1.14} {'loss': 0.26, 'grad_norm': 0.592210054397583, 'learning_rate': 9.529201968327616e-05, 'epoch': 1.36} {'loss': 0.249, 'grad_norm': 0.5416140556335449, 'learning_rate': 9.209522133654969e-05, 'epoch': 1.59} {'loss': 0.244, 'grad_norm': 0.39804431796073914, 'learning_rate': 8.815983909692943e-05, 'epoch': 1.82} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-88 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-13 12:00:58,171 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:796] 2024-07-13 12:00:58,172 >> Model config InternLM2Config { "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "eager", "auto_map": { "AutoConfig": "internlm/internlm2_5-7b-chat-1m--configuration_internlm2.InternLM2Config", "AutoModel": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 262144, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.5, "type": "dynamic" }, "rope_theta": 50000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 92544 } [INFO|tokenization_utils_base.py:2513] 2024-07-13 12:00:58,348 >> tokenizer config file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-88/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-13 12:00:58,349 >> Special tokens file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-88/special_tokens_map.json 34%|███▎ | 89/264 [1:21:18<4:28:52, 92.18s/it] 34%|███▍ | 90/264 [1:22:09<3:51:45, 79.92s/it] 34%|███▍ | 90/264 [1:22:09<3:51:45, 79.92s/it] 34%|███▍ | 91/264 [1:23:02<3:27:11, 71.86s/it] 35%|███▍ | 92/264 [1:23:54<3:08:38, 65.81s/it] 35%|███▌ | 93/264 [1:24:46<2:55:33, 61.60s/it] 36%|███▌ | 94/264 [1:25:37<2:46:04, 58.62s/it] 36%|███▌ | 95/264 [1:26:29<2:39:31, 56.63s/it] 36%|███▋ | 96/264 [1:27:21<2:34:27, 55.16s/it] 37%|███▋ | 97/264 [1:28:13<2:30:53, 54.21s/it] 37%|███▋ | 98/264 [1:29:06<2:28:34, 53.70s/it] 38%|███▊ | 99/264 [1:29:57<2:25:39, 52.97s/it] 38%|███▊ | 100/264 [1:30:47<2:22:46, 52.23s/it] 38%|███▊ | 100/264 [1:30:47<2:22:46, 52.23s/it] 38%|███▊ | 101/264 [1:31:39<2:21:31, 52.09s/it] 39%|███▊ | 102/264 [1:32:32<2:21:14, 52.31s/it] 39%|███▉ | 103/264 [1:33:25<2:20:35, 52.39s/it] 39%|███▉ | 104/264 [1:34:15<2:18:22, 51.89s/it] 40%|███▉ | 105/264 [1:35:07<2:17:42, 51.97s/it] 40%|████ | 106/264 [1:35:59<2:16:13, 51.73s/it] 41%|████ | 107/264 [1:36:50<2:14:59, 51.59s/it] 41%|████ | 108/264 [1:37:42<2:14:29, 51.73s/it] 41%|████▏ | 109/264 [1:38:34<2:13:51, 51.82s/it] 42%|████▏ | 110/264 [1:39:24<2:11:42, 51.32s/it] 42%|████▏ | 110/264 [1:39:24<2:11:42, 51.32s/it] 42%|████▏ | 111/264 [1:40:14<2:10:11, 51.05s/it] 42%|████▏ | 112/264 [1:41:07<2:10:40, 51.58s/it] 43%|████▎ | 113/264 [1:41:59<2:10:11, 51.73s/it] 43%|████▎ | 114/264 [1:42:51<2:09:29, 51.80s/it] 44%|████▎ | 115/264 [1:43:44<2:09:33, 52.17s/it] 44%|████▍ | 116/264 [1:44:35<2:07:36, 51.73s/it] 44%|████▍ | 117/264 [1:45:27<2:06:42, 51.72s/it] 45%|████▍ | 118/264 [1:46:19<2:06:35, 52.02s/it] 45%|████▌ | 119/264 [1:47:11<2:05:14, 51.83s/it] 45%|████▌ | 120/264 [1:48:02<2:03:55, 51.64s/it] 45%|████▌ | 120/264 [1:48:02<2:03:55, 51.64s/it] 46%|████▌ | 121/264 [1:48:53<2:02:24, 51.36s/it] 46%|████▌ | 122/264 [1:49:45<2:01:59, 51.55s/it] 47%|████▋ | 123/264 [1:50:37<2:01:20, 51.63s/it] 47%|████▋ | 124/264 [1:51:28<2:00:06, 51.48s/it] 47%|████▋ | 125/264 [1:52:20<1:59:59, 51.79s/it] 48%|████▊ | 126/264 [1:53:11<1:58:36, 51.57s/it] 48%|████▊ | 127/264 [1:54:03<1:58:02, 51.70s/it] 48%|████▊ | 128/264 [1:54:55<1:57:23, 51.79s/it] 49%|████▉ | 129/264 [1:55:46<1:55:58, 51.54s/it] 49%|████▉ | 130/264 [1:56:38<1:55:17, 51.62s/it] 49%|████▉ | 130/264 [1:56:38<1:55:17, 51.62s/it] 50%|████▉ | 131/264 [1:57:30<1:54:57, 51.86s/it] 50%|█████ | 132/264 [1:58:20<1:52:40, 51.22s/it][INFO|trainer.py:3719] 2024-07-13 12:38:53,184 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-13 12:38:53,184 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-13 12:38:53,184 >> Batch size = 1 {'eval_loss': 0.2507841885089874, 'eval_accuracy': 0.8997333333333332, 'eval_runtime': 134.7467, 'eval_samples_per_second': 18.553, 'eval_steps_per_second': 18.553, 'epoch': 2.0} {'loss': 0.2435, 'grad_norm': 0.27605682611465454, 'learning_rate': 8.355492141795185e-05, 'epoch': 2.05} {'loss': 0.1952, 'grad_norm': 0.3273526728153229, 'learning_rate': 7.83612641219884e-05, 'epoch': 2.27} {'loss': 0.2059, 'grad_norm': 0.3215419054031372, 'learning_rate': 7.26699927929466e-05, 'epoch': 2.5} {'loss': 0.2138, 'grad_norm': 0.3129217326641083, 'learning_rate': 6.65809639276034e-05, 'epoch': 2.73} {'loss': 0.2054, 'grad_norm': 0.4610234498977661, 'learning_rate': 6.020101289825324e-05, 'epoch': 2.95} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-132 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-13 12:41:06,154 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:796] 2024-07-13 12:41:06,155 >> Model config InternLM2Config { "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "eager", "auto_map": { "AutoConfig": "internlm/internlm2_5-7b-chat-1m--configuration_internlm2.InternLM2Config", "AutoModel": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 262144, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.5, "type": "dynamic" }, "rope_theta": 50000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 92544 } [INFO|tokenization_utils_base.py:2513] 2024-07-13 12:41:06,333 >> tokenizer config file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-132/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-13 12:41:06,334 >> Special tokens file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-132/special_tokens_map.json 50%|█████ | 133/264 [2:01:25<3:19:32, 91.40s/it] 51%|█████ | 134/264 [2:02:17<2:52:21, 79.55s/it] 51%|█████ | 135/264 [2:03:09<2:33:20, 71.32s/it] 52%|█████▏ | 136/264 [2:04:01<2:19:43, 65.50s/it] 52%|█████▏ | 137/264 [2:04:54<2:10:37, 61.71s/it] 52%|█████▏ | 138/264 [2:05:46<2:03:17, 58.71s/it] 53%|█████▎ | 139/264 [2:06:37<1:57:47, 56.54s/it] 53%|█████▎ | 140/264 [2:07:28<1:53:31, 54.93s/it] 53%|█████▎ | 140/264 [2:07:29<1:53:31, 54.93s/it] 53%|█████▎ | 141/264 [2:08:19<1:50:10, 53.74s/it] 54%|█████▍ | 142/264 [2:09:11<1:47:47, 53.01s/it] 54%|█████▍ | 143/264 [2:10:03<1:46:12, 52.66s/it] 55%|█████▍ | 144/264 [2:10:55<1:45:17, 52.64s/it] 55%|█████▍ | 145/264 [2:11:47<1:44:07, 52.50s/it] 55%|█████▌ | 146/264 [2:12:40<1:43:11, 52.47s/it] 56%|█████▌ | 147/264 [2:13:31<1:41:36, 52.11s/it] 56%|█████▌ | 148/264 [2:14:22<1:40:14, 51.85s/it] 56%|█████▋ | 149/264 [2:15:14<1:39:08, 51.72s/it] 57%|█████▋ | 150/264 [2:16:06<1:38:28, 51.83s/it] 57%|█████▋ | 150/264 [2:16:06<1:38:28, 51.83s/it] 57%|█████▋ | 151/264 [2:16:58<1:37:40, 51.87s/it] 58%|█████▊ | 152/264 [2:17:49<1:36:37, 51.76s/it] 58%|█████▊ | 153/264 [2:18:42<1:36:06, 51.95s/it] 58%|█████▊ | 154/264 [2:19:33<1:34:42, 51.66s/it] 59%|█████▊ | 155/264 [2:20:24<1:33:40, 51.56s/it] 59%|█████▉ | 156/264 [2:21:16<1:32:51, 51.59s/it] 59%|█████▉ | 157/264 [2:22:07<1:32:00, 51.59s/it] 60%|█████▉ | 158/264 [2:22:58<1:30:54, 51.46s/it] 60%|██████ | 159/264 [2:23:50<1:30:02, 51.46s/it] 61%|██████ | 160/264 [2:24:42<1:29:47, 51.80s/it] 61%|██████ | 160/264 [2:24:42<1:29:47, 51.80s/it] 61%|██████ | 161/264 [2:25:33<1:28:05, 51.32s/it] 61%|██████▏ | 162/264 [2:26:25<1:27:56, 51.74s/it] 62%|██████▏ | 163/264 [2:27:17<1:27:01, 51.70s/it] 62%|██████▏ | 164/264 [2:28:09<1:26:24, 51.85s/it] 62%|██████▎ | 165/264 [2:29:00<1:25:03, 51.55s/it] 63%|██████▎ | 166/264 [2:29:51<1:24:10, 51.53s/it] 63%|██████▎ | 167/264 [2:30:43<1:23:29, 51.65s/it] 64%|██████▎ | 168/264 [2:31:34<1:22:19, 51.46s/it] 64%|██████▍ | 169/264 [2:32:26<1:21:40, 51.58s/it] 64%|██████▍ | 170/264 [2:33:20<1:21:43, 52.16s/it] 64%|██████▍ | 170/264 [2:33:20<1:21:43, 52.16s/it] 65%|██████▍ | 171/264 [2:34:11<1:20:23, 51.86s/it] 65%|██████▌ | 172/264 [2:35:03<1:19:43, 51.99s/it] 66%|██████▌ | 173/264 [2:35:55<1:18:57, 52.06s/it] 66%|██████▌ | 174/264 [2:36:47<1:17:49, 51.89s/it] 66%|██████▋ | 175/264 [2:37:38<1:16:46, 51.76s/it] 67%|██████▋ | 176/264 [2:38:27<1:14:26, 50.76s/it][INFO|trainer.py:3719] 2024-07-13 13:18:59,828 >> ***** Running Evaluation ***** [INFO|trainer.py:3721] 2024-07-13 13:18:59,828 >> Num examples = 2500 [INFO|trainer.py:3724] 2024-07-13 13:18:59,828 >> Batch size = 1 {'eval_loss': 0.25276708602905273, 'eval_accuracy': 0.9028666666666668, 'eval_runtime': 132.4222, 'eval_samples_per_second': 18.879, 'eval_steps_per_second': 18.879, 'epoch': 3.0} {'loss': 0.1583, 'grad_norm': 0.3851621150970459, 'learning_rate': 5.364207946713318e-05, 'epoch': 3.18} {'loss': 0.1516, 'grad_norm': 0.4073261618614197, 'learning_rate': 4.701924374150901e-05, 'epoch': 3.41} {'loss': 0.1467, 'grad_norm': 0.3168102204799652, 'learning_rate': 4.044870702967461e-05, 'epoch': 3.64} {'loss': 0.1532, 'grad_norm': 0.5770123600959778, 'learning_rate': 3.404575302486039e-05, 'epoch': 3.86} 0%| | 0/2500 [00:00> Saving model checkpoint to saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-176 /common/home/users/d/dh.huang.2023/.conda/envs/llm-perf-bench/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [INFO|configuration_utils.py:733] 2024-07-13 13:21:12,771 >> loading configuration file config.json from cache at /common/scratch/users/d/dh.huang.2023/transformers/hub/models--internlm--internlm2_5-7b-chat-1m/snapshots/8d1a709a04d71440ef3df6ebbe204672f411c8b6/config.json [INFO|configuration_utils.py:796] 2024-07-13 13:21:12,771 >> Model config InternLM2Config { "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "eager", "auto_map": { "AutoConfig": "internlm/internlm2_5-7b-chat-1m--configuration_internlm2.InternLM2Config", "AutoModel": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat-1m--modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 262144, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.5, "type": "dynamic" }, "rope_theta": 50000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "vocab_size": 92544 } [INFO|tokenization_utils_base.py:2513] 2024-07-13 13:21:12,945 >> tokenizer config file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-176/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-07-13 13:21:12,946 >> Special tokens file saved in saves/internlm2_5_7b/lora/sft_bf16_p1_full/checkpoint-176/special_tokens_map.json 67%|██████▋ | 177/264 [2:41:32<2:12:05, 91.10s/it] 67%|██████▋ | 178/264 [2:42:24<1:53:55, 79.49s/it] 68%|██████▊ | 179/264 [2:43:16<1:40:55, 71.24s/it] 68%|██████▊ | 180/264 [2:44:09<1:31:41, 65.50s/it] 68%|██████▊ | 180/264 [2:44:09<1:31:41, 65.50s/it] 69%|██████▊ | 181/264 [2:45:01<1:25:07, 61.53s/it] 69%|██████▉ | 182/264 [2:45:53<1:20:07, 58.63s/it] 69%|██████▉ | 183/264 [2:46:45<1:16:26, 56.63s/it] 70%|██████▉ | 184/264 [2:47:38<1:14:07, 55.59s/it] 70%|███████ | 185/264 [2:48:29<1:11:39, 54.42s/it] 70%|███████ | 186/264 [2:49:22<1:09:58, 53.83s/it] 71%|███████ | 187/264 [2:50:14<1:08:31, 53.39s/it] 71%|███████ | 188/264 [2:51:08<1:07:37, 53.39s/it] 72%|███████▏ | 189/264 [2:51:59<1:06:00, 52.81s/it] 72%|███srun: Job step aborted: Waiting up to 17 seconds for job step to finish. ████▏ | 190/264 [2:52:51<1:04:53, 52.62s/it] 72%|███████▏ | 190/264 [2:52:51<1:04:53, 52.62s/it] 72%|███████▏ | 191/264 [2:53:44<1:03:51, 52.49s/it] 73%|███████▎ | 192/264 [2:54:36<1:02:53, 52.41s/it] 73%|███████▎ | 193/264 [2:55:27<1:01:32, 52.01s/it] 73%|███████▎ | 194/264 [2:56:19<1:00:43, 52.05s/it] 74%|███████▍ | 195/264 [2:57:10<59:34, 51.81s/it] 74%|███████▍ | 196/264 [2:58:02<58:42, 51.81s/it] 75%|███████▍ | 197/264 [2:58:53<57:42, 51.67s/it] 75%|███████▌ | 198/264 [2:59:44<56:37, 51.48s/it] 75%|███████▌ | 199/264 [3:00:36<55:39, 51.38s/it]slurmstepd: error: *** STEP 71094.0 ON holiday CANCELLED AT 2024-07-13T13:41:34 DUE TO PREEMPTION *** slurmstepd: error: *** JOB 71094 ON holiday CANCELLED AT 2024-07-13T13:41:34 DUE TO PREEMPTION *** Job ID: 71094 Cluster: crimson User/Group: dh.huang.2023/dh.huang.2023 State: PREEMPTED (exit code 0) Nodes: 1 Cores per node: 10 CPU Utilized: 00:00:01 CPU Efficiency: 0.00% of 1-06:16:10 core-walltime Job Wall-clock time: 03:01:37 Memory Utilized: 2.04 GB Memory Efficiency: 0.80% of 256.00 GB