[2024-07-15 08:07:05,974] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-15 08:08:29,643] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-15 08:08:29,660] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-15 08:08:29,667] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-15 08:08:29,671] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-15 08:08:55,588] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-15 08:08:55,588] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-07-15 08:08:55,591] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-15 08:08:55,602] [INFO] [comm.py:637:init_distributed] cdb=None 2024-07-15 08:08:55 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/share/home/models/CHT-Mistral-7B/CHT-6/', model_revision='main', model_code_revision=None, torch_dtype=None, tokenizer_name_or_path=None, trust_remote_code=False, use_flash_attention_2=False, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2024-07-15 08:08:55 - INFO - __main__ - Data parameters DataArguments(chat_template=None, text_column='text', preprocessing_num_workers=12, truncation_side=None, auto_insert_empty_system_msg=True, train_file='data/training_ptt_dpo-include_id-converted.jsonl', validation_file='data/testing_ptt_dpo_include_id-converted.jsonl', cache_dir='cache_dir') 2024-07-15 08:08:55 - INFO - __main__ - Training/evaluation parameters DPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, batch_eval_metrics=False, beta=0.01, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=steps, evaluation_strategy=steps, f_alpha_divergence_coef=1.0, f_divergence_type=FDivergenceType.REVERSE_KL, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_model_revision=main, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=results/DPO/output_dir/runs/Jul15_08-08-55_CKIP-LLM-VM1, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_length=1024, max_prompt_length=512, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=DPO/CHT-6_20240715-1605, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, precompute_ref_log_probs=False, prediction_loss_only=False, push_to_hub=True, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['tensorboard', 'wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, rpo_alpha=None, run_name=None, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=100, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, sync_ref_model=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) DatasetDict({ train: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 1728 }) test: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 91 }) }) DatasetDict({ train: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 1728 }) test: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 91 }) }) DatasetDict({ train: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 1728 }) test: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 91 }) }) 2024-07-15 08:08:57 - INFO - __main__ - Decontaminated 0 (0.00%) samples from the training set. 2024-07-15 08:08:57 - INFO - __main__ - Prompt sample 1309 of the raw training set: Human: 文章:輿情作者,日期,輿情內文 fspt01,6/20,1.近期晚上4G行動網路常常中斷,從110/05/28開始,特別開始記錄4G網路品質。 我也不是沒有先跟中華電信客服人員反應遇到的障礙,但畢竟是跟固定設備投資 有關,客服人員、機房人員只能敷衍了事。 2.一直以來,很少關心公民議題,這次被搞到一肚子火,認真思考有那些機構,可以 讓我們老百姓反應中華電信壟斷性議題。 網路上看到這則, http://www.ftc.gov.tw/uploadDecision/dd3d922b-bc37-4911-8141-b42779405ab4.pdf 我是看不太懂,大概了解「公平交易委員會」處分「中華電信股份有限公司」新台幣 500萬元罰鍰。深深感到,這或許是一條路。 3.因為我不懂很多東西,希望各個專業領域版友能提供一些資訊。 ebbf,06/20 23:53,有人壓著你只能選種花? bcd21086999,06/21 00:22,噓 cv17,06/21 00:24,??? obuibo2000,06/21 00:26,任何室內網路不良的,我只推固網. obuibo2000,06/21 00:26,要賭便宜有便宜的作法,地點好或是挑業者。 ufoubjlbotp,06/21 00:36,人家罰固網廣告不實,與壟斷何干 kbto4560,06/21 00:38,中華電信問題:使用者問題 kbto4560,06/21 00:38,XXX電信問題:誰叫你要用XXX djuzqpsu,06/21 00:41,光世代的廣告跟4G訊號有啥關係??? gvlbp21,06/21 01:02,呵...為何不攜碼出去? qbvm40807,06/21 01:10,真的爛要求無條件解約就好了 台灣有5間電信 比中國 qbvm40807,06/21 01:10,還多欸 更不用說背後也都是獨立的大財團 說壟斷? 你 qbvm40807,06/21 01:10,是認真的嗎??? lojhiu326,06/21 01:32,扯到壟斷就太過了XD 當其他電信塑膠? lplpllp416,06/21 01:38,去申請別家試用看看,哪家情況好就跳哪家,你這種 lplpllp416,06/21 01:38,情況下為何執著於同一家… sa759,06/21 01:50,壟...壟斷4G? TvohIzvo,06/21 02:04,推!權益是爭取來的! Bovunjbp,06/21 02:46,不是湊足關鍵字就能遮掩你是想公審中華電信的奧客 zffpvp,06/21 03:02,中華4G真的爛 標5G亂喊價 把4G用戶拖下水 bsjbeof,06/21 04:39,也許你家是其他業者黃金地段 快跳槽吧 這沒壟斷 XD bsjbeof,06/21 04:43,種花也許在你家地段受到其他抗議而放棄但不構成壟斷 nbocpx77,06/21 06:26,你先翻開合約好好看一下 就不會想去浪費這個時間了 Epsffo1010,06/21 06:55,......... ay27771531,06/21 07:20,所以你要解釋那裡壟斷了嗎 Gn4o,06/21 07:28,…… ex7931425,06/21 07:58,中華電信壟斷? h6i630,06/21 09:26,你不是只有中華可選,何來壟斷=.=? hbnfljoh,06/21 10:19,不懂的東西你可以查清楚了在上來發文 顆顆 tpwb0809,06/21 11:17,測智商的發文 SD8377,06/21 11:50,什麼啦 NfohYjbo,06/21 11:56,中華4g哪有壟斷,原po是不是搞錯什麼了 kvtujo332805,06/21 12:07,過了超過12小時沒回應 kvtujo332805,06/21 12:07,…感覺又是一篇射後不理文了 ol950357,06/21 12:09,???????????壟斷在那裡??? tpoza700,06/21 12:19,? cfopbjdp,06/21 12:33,台灣鯛好吃 bqqmfxbsn,06/21 14:19,呃呃呃,該從哪裡開始吐槽 ibolthp2012,06/21 15:39,這叫公民議題? nfqbmjoh,06/21 16:12,請先理解甚麼是行動網路跟固定網路好嗎 別的先不說 nfqbmjoh,06/21 16:12,拿光世代的資料比行動網路??????? tijofMpwf,06/21 18:41,這是公民議題? dbnfmmbmb,06/22 03:05,… djnpof : 認真回,06/22 10:02,能歸責於種花的大概只有網路太多人使用以致 djnpof,06/22 10:05,網路中斷 其他很難要歸責種花 (干擾或訊號不佳等) djnpof,06/22 10:08,曾有聽過有人因鄰居晚間使用電器而造成網路中斷的 djnpof,06/22 10:09,像這種狀況的 是種花的問題嗎? bmfwfmnbo,06/22 11:25,疫情開始後,網路確實慢很多,共體時艱吧 === 問題:請對以上討論進行所有意見的分類與摘要,類別有六種: 建議、質疑、疑問、抱怨、讚美、提供資訊。意見摘要用條列的方式呈現,每條意見之後用括號包含持此意見的發>言者,原本文章作者的意見也必須一併分析,同一條意見可能包含不只一人。 2024-07-15 08:08:57 - INFO - __main__ - Chosen sample 1309 of the raw training set: Assistant: 建議: - 建議可以換其他電信業者,並試試其他業者的服務品質如何 (ebbf,obuibo2000,lplpllp416,bsjbeof) - 建議對方先翻開並理解合約內容,便不會浪費時間 (nbocpx77) - 固定網路比較不會出現問題,不應該使用行動網路來取代 (obuibo2000) - 如果真的覺得爛到無法忍受,可以請求無條件解約 (qbvm40807) 質疑: - 中華電信的問題被視為用戶問題,其他電信的問題則是因為選擇該電信業者 (kbto4560) - 中華電信4G和5G的問題和光世代的廣告無關 (djuzqpsu) - 對於原po以壟斷來批評中華電信的用詞有疑問 (ufoubjlbotp, qbvm40807, lojhiu326, sa759, ex7931425,h6i630,NfohYjbo) - 懷疑原po是個奧客,想要公審中華電信 (Bovunjbp) - 懷疑原po沒有好好理解公平交易法和壟斷的意義 (hbnfljoh) - 懷疑原po發完文後就沒有回應,可能是在故意搞事 (kvtujo332805) 疑問: - 原po所說的壟斷是指什麼?為何認為中華電信壟斷了行動網路市場?(ay27771531, lojhiu326,NfohYjbo,ol950357,h6i630) - 不理解原po為什麼堅持使用同一家信號不好的電信業者而不換家試試 (lplpllp416,Gn4o) 抱怨: - 抱怨中華電信4G的不穩定,標5G價格卻又拖累4G用戶 (fspt01, zffpvp) 讚美: - 原po對公民議題的態度值得肯定。 (TvohIzvo) 提供資訊: - 台灣有五家電信業者,背後都有大財團支援,沒有壟斷的問題 (qbvm40807) - 台灣的網路使用量在疫情期間增加,速度變慢是共體時艱 (bmfwfmnbo) - 網路使用時段、設備種類、位置等因素都可能影響網路品質,並非完全由電信業者控制 (djnpof) 2024-07-15 08:08:57 - INFO - __main__ - Rejected sample 1309 of the raw training set: Assistant: 建議: 1. 應該讓老百姓反應中華電信壟斷性議題 (fspt01) 2. 應該攜碼出去 (gvlbp21) 3. 應該去申請別家試用看看,哪家情況好就跳哪家 (lplpllp416) 4. 應該先翻開合約好好看一下,就不會想去浪費這個時間了 (nbocpx77) 質疑: 1. 質疑中華電信是否真的壟斷4G (sa759, a7931425, h6i630, hbnfljoh, ol950357, tpoza700, cfopbjdp, bqqmfxbsn, ibolthp2012, nfqbmjoh, tijofMpwf, dbnfmmbmb) 2. 質疑種花是否構成壟斷 (bsjbeof) 疑問: 1. 疑問公民議題是什麼 (fspt01, ibolthp2012, nfqbmjoh, tijofMpwf) 2. 疑問壟斷在哪裡 (ay27771531, Gn4o, ex7931425, ol950357, tpoza700, cfopbjdp, bqqmfxbsn, ibolthp2012, nfqbmjoh, tijofMpwf, dbnfmmbmb) 抱怨: 1. 抱怨4G網路品質不好 (fspt01) 2. 抱怨中華電信壟斷 (fspt01, kbto4560, zffpvp, bsjbeof) 3. 抱怨網路中斷 (nfqbmjoh) 讚美: 1. 讚美固網 (obuibo2000) 2. 讚美公平交易委員會處分中華電信的行為 (fspt01) 提供資訊: 1. 提供了公平交易委員會處分中華電信的資訊 (fspt01) 2. 提供了中華電信被罰款的資訊 (ebbf, cv17, obuibo2000, ufoubjlbotp, kbto4560, djuzqpsu, gvlbp21, qbvm40807, lojhiu326, lplpllp416, sa759, TvohIzvo, Bovunjbp, zffpvp, bsjbeof, nbocpx77, Epsffo1010, ay27771531, Gn4o, ex7931425, h6i630, hbnfljoh, ol950357, tpoza700, cfopbjdp, bqqmfxbsn, ibolthp2012, nfqbmjoh, tijofMpwf, dbnfmmbmb) 2024-07-15 08:08:57 - INFO - __main__ - Prompt sample 228 of the raw training set: Human: 文章:輿情作者,日期,輿情內文 bcdtbfhvtb11,6/25,照上面說的1199以下等於是5G的量到降速吃到飽,1399開始才是真的5G吃到飽 現在4G吃到飽我家靠窗速度還有100M上下,晚上有時速度會慢一點不到10M 今天看到5G方案讓我猶豫要不要升級5G了 有2個問題 1.50G的額度用USB分享給電腦也算嗎?還是開Wifi熱點才是? 2.如果用1399是熱點超過50G連本身吃到飽也降速到10M?還是只有被分享的裝置會降速到10M? qpqpqbm,06/25 11:22,熱點分享量到是鎖卡 只剩手機能用吧 目前看起來是 qpqpqbm,06/25 11:22,這樣玩 不夠這也是中華董事長有說目前涵蓋不過才有 qpqpqbm,06/25 11:22,限制 未來就不知道了 zjfstbo,06/25 11:31,限制一堆價錢這樣 以為消費者會買單嗎 天真 qpqpqbm,06/25 11:32,市場跟客戶需求有關 客戶有需求就要滿足XD qpqpqbm,06/25 11:33,https://reurl.cc/L3zKMe 高資費1399以上用的爽就好 qpqpqbm,06/25 11:34,不夠用早該辦固網滿足辦公家庭需求... uudnm,06/25 12:07,我也好奇,熱點流量用完,手機的吃到飽應該不會限速 uudnm,06/25 12:07,吧 qpqpqbm,06/25 12:11,手機沒影響 就是要擋大戶 分享流量夠一般基本用途 olit9412235,06/25 14:51,為什麼拿台北市跟東京比,因為都是大都市,比較對 olit9412235,06/25 14:51,象是 docomo 跟中華電信,也是兩邊電信龍頭,這是 olit9412235,06/25 14:51,親身體驗。這還是手機而已,如果是租分享器速度會 olit9412235,06/25 14:51,更快。 lbnjs,06/25 15:18,額度是全部,包含使用SIM卡網路的所有裝置。 gmjfstlz,06/25 21:14,容量哪有在分你開分享與否?有網路都是用5g跑阿 === 問題:請對以上討論進行所有意見的分類與摘要,類別有六種: 建議、質疑、疑問、抱怨、讚美、提供資訊。意見摘要用條列的方式呈現,每條意見之後用括號包含持此意見的發>言者,原本文章作者的意見也必須一併分析,同一條意見可能包含不只一人。 2024-07-15 08:08:57 - INFO - __main__ - Chosen sample 228 of the raw training set: Assistant: 建議: 1. 可以考慮升級5G(bcdtbfhvtb11) 2. 需求不夠建議可以辦固網滿足辦公家庭需求(qpqpqbm) 質疑: 1. 對於新的5G方案有所疑慮,質疑是否值得升級(bcdtbfhvtb11) 2. 質疑行動網路公司將價格定得過高,以及設置太多限制(zjfstbo) 疑問: 1. 50G的額度是否包含用USB分享給電腦的數據(bcdtbfhvtb11) 2. 如果用1399是熱點超過50G連本身吃到飽也降速到10M(bcdtbfhvtb11) 3. 對於熱點流量用完後,手機的吃到飽是否會限速(uudnm) 抱怨: 1. 對於5G資費計劃有許多限制,價格又偏高感到不滿(zjfstbo) 讚美: 1. 沒有讚美的意見 提供資訊: 1. 分享數據到其他裝置也會計算在內(lbnjs, gmjfstlz) 2. 若熱點流量用完,手機所使用的數據還是5G數據並不受限制(uudnm, qpqpqbm) 3. 對於市場需求關係的說明(qpqpqbm) 4. 台北市與東京大都市的比較(olit9412235) 5. 提供相關的網路連結(qpqpqbm) 2024-07-15 08:08:57 - INFO - __main__ - Rejected sample 228 of the raw training set: Assistant: 建議: 1. 高資費1399以上用的爽就好 (qpqpqbm) 2. 不夠用早該辦固網滿足辦公家庭需求 (qpqpqbm) 質疑: 1. 限制一堆價錢這樣 以為消費者會買單嗎 天真 (zjfstbo) 疑問: 1. 1.50G的額度用USB分享給電腦也算嗎?還是開Wifi熱點才是? (bcdtbfhvtb11) 2. 如果用1399是熱點超過50G連本身吃到飽也降速到10M?還是只有被分享的裝置會降速到10M? (bcdtbfhvtb11) 抱怨: 1. 市場跟客戶需求有關 客戶有需求就要滿足XD (qpqpqbm) 讚美: 無 提供資訊: 1. 熱點分享量到是鎖卡 只剩手機能用吧 目前看起來是這樣玩 不夠這也是中華董事長有說目前涵蓋不過才有限制 未來就不知道了 (qpqpqbm) 2. 手機沒影響 就是要擋大戶 分享流量夠一般基本用途 (qpqpqbm) 3. 額度是全部,包含使用SIM卡網路的所有裝置 (lbnjs) 4. 容量哪有在分你開分享與否?有網路都是用5g跑阿 (gmjfstlz) 5. 為什麼拿台北市跟東京比,因為都是大都市,比較對象是 docomo 跟中華電信,也是兩邊電信龍頭,這是親身體驗 (olit9412235) 2024-07-15 08:08:57 - INFO - __main__ - Prompt sample 51 of the raw training set: Human: 文章:輿情作者,日期,輿情內文 epodmjboh,4/22,大家好!! 小弟這次剛好符合中華的公教雙飽資格, 想衝一波, 但是只想辦個299的就好, 因為平常都是用預付卡買1G, 不過有看到299裡面, 有個優惠是群內優惠 免費, 想問這個 是要需要多加50加電話號碼進群組嗎? 還是就是299就可以直接設定電話進群組 對這個有點不懂,想問有無人可以解答 謝謝大大們 kbnft732,04/22 18:29,你要不要考慮直接問客服? qvtiqvmm,04/22 18:34,好像不用加50。 剛好今天辦299有問。 yv3,04/22 18:44,台哥的群內免費試 每個單位有一個編號 例如台北市政 yv3,04/22 18:45,府教育局 編號是1234 那教育局的員工辦公教專案 === 問題:請對以上討論進行所有意見的分類與摘要,類別有六種: 建議、質疑、疑問、抱怨、讚美、提供資訊。意見摘要用條列的方式呈現,每條意見之後用括號包含持此意見的發>言者,原本文章作者的意見也必須一併分析,同一條意見可能包含不只一人。 2024-07-15 08:08:57 - INFO - __main__ - Chosen sample 51 of the raw training set: Assistant: - 疑問:想要了解299元的方案是不是可以直接加入群組,還是要多加50元,並希望有人能提供資訊 (epodmjboh) - 建議:直接詢問客服以獲得最確實資訊 (kbnft732) - 提供資訊:今天辦理299元專案,表示不需要加50元 (qvtiqvmm) - 提供資訊:解釋群內免費的方式,每個單位組會有一個編號 (yv3) 2024-07-15 08:08:57 - INFO - __main__ - Rejected sample 51 of the raw training set: Assistant: - 疑問: - 想問這個 是要需要多加50加電話號碼進群組嗎? 還是就是299就可以直接設定電話進群組 (epodmjboh) - 想問有無人可以解答 (epodmjboh) - 建議: - 你要不要考慮直接問客服? (kbnft732) - 提供資訊: - 好像不用加50。 剛好今天辦299有問。 (qvtiqvmm) - 台哥的群內免費試 每個單位有一個編號 例如台北市政府教育局 編號是1234 那教育局的員工辦公教專案 (yv3) [INFO 20240712-1747] init DPOTrainer [INFO 20240712-1747] init DPOTrainer [INFO 20240712-1747] init DPOTrainer [2024-07-15 08:08:57,938] [INFO] [comm.py:637:init_distributed] cdb=None DatasetDict({ train: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 1728 }) test: Dataset({ features: ['prompt', 'chosen', 'rejected', 'messages'], num_rows: 91 }) }) [INFO 20240712-1747] init DPOTrainer [2024-07-15 08:09:05,078] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 291, num_elems = 7.24B [2024-07-15 08:12:06,088] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 582, num_elems = 14.48B 2024-07-15 08:12:09 - WARNING - datasets.fingerprint - Parameter 'function'=> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 2024-07-15 08:12:19 - WARNING - datasets.fingerprint - Parameter 'function'=> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 2024-07-15 08:12:19 - WARNING - datasets.fingerprint - Parameter 'function'=> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 2024-07-15 08:12:19 - WARNING - datasets.fingerprint - Parameter 'function'=> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. [2024-07-15 08:12:22,020] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.2, git-hash=unknown, git-branch=unknown [2024-07-15 08:12:22,033] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-07-15 08:12:22,035] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload [2024-07-15 08:12:22,133] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-07-15 08:12:22,134] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 7.42 GB CA 7.89 GB Max_CA 8 GB [2024-07-15 08:12:22,134] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 17.44 GB, percent = 2.0% Parameter Offload: Total persistent parameters: 266240 in 65 params [2024-07-15 08:12:22,238] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-07-15 08:12:22,238] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 6.74 GB CA 7.89 GB Max_CA 8 GB [2024-07-15 08:12:22,238] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 17.47 GB, percent = 2.0% [2024-07-15 08:12:22,239] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2024-07-15 08:12:22,239] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-07-15 08:12:22,239] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-07-15 08:12:22,239] [INFO] [config.py:976:print] amp_enabled .................. False [2024-07-15 08:12:22,239] [INFO] [config.py:976:print] amp_params ................... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] comms_config ................. [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] communication_data_type ...... None [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] disable_allgather ............ False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] dump_state ................... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] elasticity_enabled ........... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] fp16_enabled ................. False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] global_rank .................. 0 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] gradient_accumulation_steps .. 2 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] gradient_clipping ............ 1.0 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] memory_breakdown ............. False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2024-07-15 08:12:22,240] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] optimizer_name ............... None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] optimizer_params ............. None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] pld_enabled .................. False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] pld_params ................... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] prescale_gradients ........... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] scheduler_name ............... None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] scheduler_params ............. None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] sparse_attention ............. None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] steps_per_print .............. inf [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] train_batch_size ............. 64 [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 8 [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] use_node_local_storage ....... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] weight_quantization_config ... None [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] world_size ................... 4 [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] zero_allow_untested_optimizer False [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] zero_enabled ................. True [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2024-07-15 08:12:22,241] [INFO] [config.py:976:print] zero_optimization_stage ...... 3 [2024-07-15 08:12:22,241] [INFO] [config.py:962:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 2, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "nvme_path": null }, "offload_param": { "device": "none", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_clipping": 1.0, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_optimization.reduce_bucket_size": 1.677722e+07, "zero_optimization.stage3_param_persistence_threshold": 4.096000e+04, "zero_optimization.stage3_prefetch_bucket_size": 1.509949e+07 } [INFO 20240712-1747] ☆☆☆ [DEBUG 20240712-1804] checkpoint = None [INFO 20240712-1747] ☆☆☆ [DEBUG 20240712-1804] checkpoint = None [INFO 20240712-1747] ☆☆☆ [DEBUG 20240712-1804] checkpoint = None [INFO 20240712-1747] ☆☆☆ [DEBUG 20240712-1804] checkpoint = None [2024-07-15 08:12:47,906] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.2, git-hash=unknown, git-branch=unknown [2024-07-15 08:12:47,913] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-07-15 08:12:47,914] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-07-15 08:12:47,914] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-07-15 08:12:47,924] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2024-07-15 08:12:47,924] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type= [2024-07-15 08:12:47,924] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2024-07-15 08:12:47,924] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2024-07-15 08:12:48,039] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning [2024-07-15 08:12:48,040] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 6.74 GB CA 6.94 GB Max_CA 8 GB [2024-07-15 08:12:48,040] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.79 GB, percent = 2.2% [2024-07-15 08:12:48,041] [INFO] [stage3.py:126:__init__] Reduce bucket size 500,000,000 [2024-07-15 08:12:48,042] [INFO] [stage3.py:127:__init__] Prefetch bucket size 50,000,000 [2024-07-15 08:12:48,153] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-07-15 08:12:48,154] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 6.74 GB CA 6.94 GB Max_CA 7 GB [2024-07-15 08:12:48,154] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.79 GB, percent = 2.2% Parameter Offload: Total persistent parameters: 266240 in 65 params [2024-07-15 08:12:48,277] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-07-15 08:12:48,278] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 6.74 GB CA 6.94 GB Max_CA 7 GB [2024-07-15 08:12:48,278] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.79 GB, percent = 2.2% [2024-07-15 08:12:48,385] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions [2024-07-15 08:12:48,386] [INFO] [utils.py:803:see_memory_usage] MA 6.74 GB Max_MA 6.74 GB CA 6.94 GB Max_CA 7 GB [2024-07-15 08:12:48,386] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.79 GB, percent = 2.2% [2024-07-15 08:12:49,843] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2 [2024-07-15 08:12:49,844] [INFO] [utils.py:803:see_memory_usage] MA 6.75 GB Max_MA 6.75 GB CA 7.28 GB Max_CA 7 GB [2024-07-15 08:12:49,844] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.8 GB, percent = 2.2% [2024-07-15 08:12:49,946] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions [2024-07-15 08:12:49,946] [INFO] [utils.py:803:see_memory_usage] MA 6.75 GB Max_MA 6.75 GB CA 7.28 GB Max_CA 7 GB [2024-07-15 08:12:49,946] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.8 GB, percent = 2.2% [2024-07-15 08:12:50,055] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions [2024-07-15 08:12:50,055] [INFO] [utils.py:803:see_memory_usage] MA 13.49 GB Max_MA 14.97 GB CA 15.92 GB Max_CA 16 GB [2024-07-15 08:12:50,055] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.8 GB, percent = 2.2% [2024-07-15 08:12:50,175] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states [2024-07-15 08:12:50,175] [INFO] [utils.py:803:see_memory_usage] MA 13.49 GB Max_MA 13.49 GB CA 15.92 GB Max_CA 16 GB [2024-07-15 08:12:50,176] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.81 GB, percent = 2.2% [2024-07-15 08:12:50,322] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states [2024-07-15 08:12:50,323] [INFO] [utils.py:803:see_memory_usage] MA 26.98 GB Max_MA 33.73 GB CA 36.97 GB Max_CA 37 GB [2024-07-15 08:12:50,323] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.81 GB, percent = 2.2% [2024-07-15 08:12:50,324] [INFO] [stage3.py:460:_setup_for_real_optimizer] optimizer state initialized [2024-07-15 08:12:50,635] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer [2024-07-15 08:12:50,636] [INFO] [utils.py:803:see_memory_usage] MA 31.29 GB Max_MA 31.77 GB CA 51.01 GB Max_CA 51 GB [2024-07-15 08:12:50,636] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 18.81 GB, percent = 2.2% [2024-07-15 08:12:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW [2024-07-15 08:12:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-07-15 08:12:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-07-15 08:12:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)] [2024-07-15 08:12:50,637] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] amp_enabled .................. False [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] amp_params ................... False [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] comms_config ................. [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] communication_data_type ...... None [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-07-15 08:12:50,637] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] disable_allgather ............ False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] dump_state ................... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] elasticity_enabled ........... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] fp16_enabled ................. False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] global_rank .................. 0 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] gradient_accumulation_steps .. 2 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] gradient_clipping ............ 1.0 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] memory_breakdown ............. False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] optimizer_name ............... None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] optimizer_params ............. None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] pld_enabled .................. False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] pld_params ................... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] prescale_gradients ........... False [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] scheduler_name ............... None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] scheduler_params ............. None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] sparse_attention ............. None [2024-07-15 08:12:50,638] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] steps_per_print .............. inf [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] train_batch_size ............. 64 [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 8 [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] use_node_local_storage ....... False [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] weight_quantization_config ... None [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] world_size ................... 4 [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] zero_allow_untested_optimizer True [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] zero_enabled ................. True [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2024-07-15 08:12:50,639] [INFO] [config.py:976:print] zero_optimization_stage ...... 3 [2024-07-15 08:12:50,639] [INFO] [config.py:962:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 2, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "nvme_path": null }, "offload_param": { "device": "none", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_clipping": 1.0, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true } [2024-07-15 08:13:21,659] [WARNING] [stage3.py:1949:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.6931, 'grad_norm': 21.959526805466417, 'learning_rate': 1.6666666666666665e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -430.01824951171875, 'logps/chosen': -522.2332763671875, 'logits/rejected': -1.5321521759033203, 'logits/chosen': -1.5251624584197998, 'epoch': 0.04} {'loss': 0.6931, 'grad_norm': 20.292728976465323, 'learning_rate': 3.333333333333333e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -407.6622314453125, 'logps/chosen': -442.01617431640625, 'logits/rejected': -1.5653201341629028, 'logits/chosen': -1.58231520652771, 'epoch': 0.07} {'loss': 0.6925, 'grad_norm': 21.101947369321714, 'learning_rate': 5e-07, 'rewards/chosen': -0.001064786920323968, 'rewards/rejected': -0.0005854797782376409, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.000479307200293988, 'logps/rejected': -416.7240905761719, 'logps/chosen': -491.49005126953125, 'logits/rejected': -1.5687460899353027, 'logits/chosen': -1.5684146881103516, 'epoch': 0.11} {'loss': 0.6911, 'grad_norm': 21.29780296562762, 'learning_rate': 4.978612153434526e-07, 'rewards/chosen': -0.0003022289602085948, 'rewards/rejected': -0.003410081844776869, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.003107852768152952, 'logps/rejected': -379.46832275390625, 'logps/chosen': -509.3629150390625, 'logits/rejected': -1.6125160455703735, 'logits/chosen': -1.5377585887908936, 'epoch': 0.15} {'loss': 0.6852, 'grad_norm': 20.48824994961677, 'learning_rate': 4.91481456572267e-07, 'rewards/chosen': 0.001991434022784233, 'rewards/rejected': -0.012363090179860592, 'rewards/accuracies': 0.875, 'rewards/margins': 0.0143545251339674, 'logps/rejected': -403.9041748046875, 'logps/chosen': -488.50262451171875, 'logits/rejected': -1.5993618965148926, 'logits/chosen': -1.548508644104004, 'epoch': 0.19} {'loss': 0.6695, 'grad_norm': 21.06721548722646, 'learning_rate': 4.809698831278217e-07, 'rewards/chosen': 0.0072067356668412685, 'rewards/rejected': -0.03501763194799423, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.04222436994314194, 'logps/rejected': -404.1998596191406, 'logps/chosen': -471.6202087402344, 'logits/rejected': -1.5432801246643066, 'logits/chosen': -1.4877359867095947, 'epoch': 0.22} {'loss': 0.6585, 'grad_norm': 19.330099709889698, 'learning_rate': 4.6650635094610966e-07, 'rewards/chosen': 0.004273919388651848, 'rewards/rejected': -0.057606834918260574, 'rewards/accuracies': 1.0, 'rewards/margins': 0.06188075616955757, 'logps/rejected': -342.83990478515625, 'logps/chosen': -442.44384765625, 'logits/rejected': -1.5952730178833008, 'logits/chosen': -1.5653460025787354, 'epoch': 0.26} {'loss': 0.6266, 'grad_norm': 20.582435449086052, 'learning_rate': 4.483383350728088e-07, 'rewards/chosen': 0.014045600779354572, 'rewards/rejected': -0.10960347950458527, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.12364907562732697, 'logps/rejected': -454.7938232421875, 'logps/chosen': -556.8931884765625, 'logits/rejected': -1.496585488319397, 'logits/chosen': -1.4436156749725342, 'epoch': 0.3} {'loss': 0.6093, 'grad_norm': 19.02137727505308, 'learning_rate': 4.2677669529663686e-07, 'rewards/chosen': 0.0047954656183719635, 'rewards/rejected': -0.2064724564552307, 'rewards/accuracies': 1.0, 'rewards/margins': 0.21126791834831238, 'logps/rejected': -464.00732421875, 'logps/chosen': -545.1778564453125, 'logits/rejected': -1.5011253356933594, 'logits/chosen': -1.4762550592422485, 'epoch': 0.33} {'loss': 0.5892, 'grad_norm': 18.29916228129018, 'learning_rate': 4.0219035725218013e-07, 'rewards/chosen': 0.006784859113395214, 'rewards/rejected': -0.2541770935058594, 'rewards/accuracies': 1.0, 'rewards/margins': 0.2609619200229645, 'logps/rejected': -449.64532470703125, 'logps/chosen': -547.4830322265625, 'logits/rejected': -1.5759844779968262, 'logits/chosen': -1.576920986175537, 'epoch': 0.37} {'loss': 0.5794, 'grad_norm': 17.619368905828807, 'learning_rate': 3.75e-07, 'rewards/chosen': -0.055743396282196045, 'rewards/rejected': -0.24957191944122314, 'rewards/accuracies': 1.0, 'rewards/margins': 0.1938285231590271, 'logps/rejected': -381.874267578125, 'logps/chosen': -413.1097106933594, 'logits/rejected': -1.5630524158477783, 'logits/chosen': -1.5285149812698364, 'epoch': 0.41} {'loss': 0.5319, 'grad_norm': 19.01708114625534, 'learning_rate': 3.4567085809127245e-07, 'rewards/chosen': -0.08890116214752197, 'rewards/rejected': -0.40977126359939575, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.3208701014518738, 'logps/rejected': -390.45758056640625, 'logps/chosen': -429.37237548828125, 'logits/rejected': -1.504837989807129, 'logits/chosen': -1.5181918144226074, 'epoch': 0.44} {'loss': 0.4825, 'grad_norm': 19.102770950198305, 'learning_rate': 3.147047612756302e-07, 'rewards/chosen': -0.1315421462059021, 'rewards/rejected': -0.6144382953643799, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4828961193561554, 'logps/rejected': -425.3389892578125, 'logps/chosen': -497.2473449707031, 'logits/rejected': -1.4981393814086914, 'logits/chosen': -1.4961140155792236, 'epoch': 0.48} {'loss': 0.4697, 'grad_norm': 19.301810836402147, 'learning_rate': 2.826315480550129e-07, 'rewards/chosen': -0.24572713673114777, 'rewards/rejected': -0.7290294170379639, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4833022952079773, 'logps/rejected': -439.08197021484375, 'logps/chosen': -486.4429016113281, 'logits/rejected': -1.4528899192810059, 'logits/chosen': -1.4483075141906738, 'epoch': 0.52} {'loss': 0.4387, 'grad_norm': 18.71279520294138, 'learning_rate': 2.5e-07, 'rewards/chosen': -0.3652060627937317, 'rewards/rejected': -0.9309228658676147, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.5657168030738831, 'logps/rejected': -494.8599548339844, 'logps/chosen': -521.008544921875, 'logits/rejected': -1.5090162754058838, 'logits/chosen': -1.5292141437530518, 'epoch': 0.56} {'loss': 0.4174, 'grad_norm': 18.793537566198992, 'learning_rate': 2.1736845194498716e-07, 'rewards/chosen': -0.5173332691192627, 'rewards/rejected': -1.037455439567566, 'rewards/accuracies': 0.875, 'rewards/margins': 0.5201221108436584, 'logps/rejected': -369.700927734375, 'logps/chosen': -459.7350158691406, 'logits/rejected': -1.4832723140716553, 'logits/chosen': -1.4446516036987305, 'epoch': 0.59} {'loss': 0.4033, 'grad_norm': 16.745079618493, 'learning_rate': 1.8529523872436977e-07, 'rewards/chosen': -0.6878727078437805, 'rewards/rejected': -1.3357090950012207, 'rewards/accuracies': 1.0, 'rewards/margins': 0.6478363275527954, 'logps/rejected': -611.6802368164062, 'logps/chosen': -628.3623046875, 'logits/rejected': -1.4667479991912842, 'logits/chosen': -1.4263123273849487, 'epoch': 0.63} {'loss': 0.4154, 'grad_norm': 16.635280686124094, 'learning_rate': 1.5432914190872756e-07, 'rewards/chosen': -0.7533223628997803, 'rewards/rejected': -1.3541579246520996, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6008355617523193, 'logps/rejected': -625.437255859375, 'logps/chosen': -662.1283569335938, 'logits/rejected': -1.4638080596923828, 'logits/chosen': -1.4674986600875854, 'epoch': 0.67} {'loss': 0.3555, 'grad_norm': 20.71677350667776, 'learning_rate': 1.2500000000000005e-07, 'rewards/chosen': -0.7397876977920532, 'rewards/rejected': -1.7132179737091064, 'rewards/accuracies': 1.0, 'rewards/margins': 0.9734303951263428, 'logps/rejected': -571.26123046875, 'logps/chosen': -588.1524047851562, 'logits/rejected': -1.5262463092803955, 'logits/chosen': -1.522387981414795, 'epoch': 0.7} {'loss': 0.3821, 'grad_norm': 15.4400444773846, 'learning_rate': 9.780964274781983e-08, 'rewards/chosen': -0.7622807025909424, 'rewards/rejected': -1.6570687294006348, 'rewards/accuracies': 1.0, 'rewards/margins': 0.8947880268096924, 'logps/rejected': -557.1793212890625, 'logps/chosen': -621.7742919921875, 'logits/rejected': -1.3520491123199463, 'logits/chosen': -1.3689525127410889, 'epoch': 0.74} {'loss': 0.3535, 'grad_norm': 15.898172432048854, 'learning_rate': 7.322330470336313e-08, 'rewards/chosen': -0.9690157771110535, 'rewards/rejected': -1.7651585340499878, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7961427569389343, 'logps/rejected': -677.109619140625, 'logps/chosen': -700.77294921875, 'logits/rejected': -1.4618651866912842, 'logits/chosen': -1.4822262525558472, 'epoch': 0.78} {'loss': 0.3546, 'grad_norm': 15.739285942357705, 'learning_rate': 5.166166492719124e-08, 'rewards/chosen': -0.9611536264419556, 'rewards/rejected': -1.9216368198394775, 'rewards/accuracies': 1.0, 'rewards/margins': 0.960483193397522, 'logps/rejected': -537.1988525390625, 'logps/chosen': -577.4271240234375, 'logits/rejected': -1.4116517305374146, 'logits/chosen': -1.437910556793213, 'epoch': 0.81} {'loss': 0.3301, 'grad_norm': 14.542054521548646, 'learning_rate': 3.349364905389032e-08, 'rewards/chosen': -0.7476925849914551, 'rewards/rejected': -2.160238265991211, 'rewards/accuracies': 1.0, 'rewards/margins': 1.4125456809997559, 'logps/rejected': -686.080322265625, 'logps/chosen': -638.280517578125, 'logits/rejected': -1.379716396331787, 'logits/chosen': -1.3872032165527344, 'epoch': 0.85} {'loss': 0.3299, 'grad_norm': 17.78699135059973, 'learning_rate': 1.9030116872178314e-08, 'rewards/chosen': -0.8232169151306152, 'rewards/rejected': -1.8412718772888184, 'rewards/accuracies': 1.0, 'rewards/margins': 1.0180549621582031, 'logps/rejected': -591.6486206054688, 'logps/chosen': -697.1530151367188, 'logits/rejected': -1.4145452976226807, 'logits/chosen': -1.3574109077453613, 'epoch': 0.89} {'loss': 0.3283, 'grad_norm': 15.532122481306784, 'learning_rate': 8.518543427732949e-09, 'rewards/chosen': -0.9585639834403992, 'rewards/rejected': -1.9474496841430664, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.9888858199119568, 'logps/rejected': -653.23974609375, 'logps/chosen': -600.0244140625, 'logits/rejected': -1.4114274978637695, 'logits/chosen': -1.4619381427764893, 'epoch': 0.93} {'loss': 0.3018, 'grad_norm': 16.641777908742164, 'learning_rate': 2.1387846565474044e-09, 'rewards/chosen': -1.0871095657348633, 'rewards/rejected': -2.2172341346740723, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.1301246881484985, 'logps/rejected': -615.214599609375, 'logps/chosen': -623.5850830078125, 'logits/rejected': -1.3277167081832886, 'logits/chosen': -1.3785361051559448, 'epoch': 0.96} {'loss': 0.3359, 'grad_norm': 13.468548124945492, 'learning_rate': 0.0, 'rewards/chosen': -0.978149950504303, 'rewards/rejected': -2.006209373474121, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.0280592441558838, 'logps/rejected': -536.807861328125, 'logps/chosen': -570.6109619140625, 'logits/rejected': -1.4331750869750977, 'logits/chosen': -1.451808214187622, 'epoch': 1.0} [2024-07-15 08:21:03,617] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step27 is about to be saved! [2024-07-15 08:21:03,645] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-07-15 08:21:03,645] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-07-15 08:21:03,733] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-07-15 08:21:03,834] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-07-15 08:26:30,541] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-07-15 08:26:30,762] [INFO] [engine.py:3393:_save_zero_checkpoint] zero checkpoint saved DPO/CHT-6_20240715-1605/checkpoint-27/global_step27/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-07-15 08:26:38,789] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step27 is ready now! [INFO 20240712-1747] ☆☆☆☆☆ [INFO 20240712-1747] ☆☆☆☆☆[INFO 20240712-1747] ☆☆☆☆☆ {'train_runtime': 828.5595, 'train_samples_per_second': 2.086, 'train_steps_per_second': 0.033, 'train_loss': 0.4969717639463919, 'epoch': 1.0}