[2024-09-25 23:47:38,760] torch.distributed.run: [WARNING] [2024-09-25 23:47:38,760] torch.distributed.run: [WARNING] ***************************************** [2024-09-25 23:47:38,760] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-25 23:47:38,760] torch.distributed.run: [WARNING] ***************************************** 2024-09-25 23:47:47.146026: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.273366: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.343966: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.347688: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.370794: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.394489: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.417703: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-09-25 23:47:47.439844: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2024-09-25 23:47:49,558] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:49,693] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:49,726] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:49,913] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:50,021] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [2024-09-25 23:47:50,088] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:50,121] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-25 23:47:50,144] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [2024-09-25 23:47:50,297] [INFO] [comm.py:637:init_distributed] cdb=None [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [2024-09-25 23:47:50,525] [INFO] [comm.py:637:init_distributed] cdb=None [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [2024-09-25 23:47:50,547] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-25 23:47:50,547] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [I 240925 23:47:50 run_dpo:1025] attn type: flash_attention_2 [2024-09-25 23:47:51,026] [INFO] [comm.py:637:init_distributed] cdb=None You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2024-09-25 23:47:51,313] [INFO] [comm.py:637:init_distributed] cdb=None /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-09-25 23:47:51,360] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-25 23:47:51,488] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-25 23:47:51,503] [INFO] [comm.py:637:init_distributed] cdb=None You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/4 [00:00 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147613 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147613:2147613 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147613:2147613 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.6+cuda12.1 /mnt/task_runtime/lmm/dpo/trl/trainer/dpo_trainer.py:307: UserWarning: `max_prompt_length` is not set in the DPOTrainer's init it will default to `128` by default, but you should do it yourself in the future. warnings.warn( /mnt/task_runtime/lmm/dpo/trl/trainer/dpo_trainer.py:307: UserWarning: `max_prompt_length` is not set in the DPOTrainer's init it will default to `128` by default, but you should do it yourself in the future. warnings.warn( /mnt/task_runtime/lmm/dpo/trl/trainer/dpo_trainer.py:307: UserWarning: `max_prompt_length` is not set in the DPOTrainer's init it will default to `128` by default, but you should do it yourself in the future. warnings.warn( /mnt/task_runtime/lmm/dpo/trl/trainer/dpo_trainer.py:307: UserWarning: `max_prompt_length` is not set in the DPOTrainer's init it will default to `128` by default, but you should do it yourself in the future. warnings.warn( bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147613:1727308088::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147613:1727308088::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147613:1727308088::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147613:1727308088::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147613:1727308088::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:0:31917778 libfabric:2147613:1727308088::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:0:1660188483 libfabric:2147613:1727308088::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:0:814382712 libfabric:2147613:1727308088::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:0:2054036245 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Using network AWS Libfabric bolt-fd37yj2rd5-5z3nm97yap:2147614:2147614 [1] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147614 [1] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147614:2147614 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147614:2147614 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147614:1727308089::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147614:1727308089::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147614:1727308089::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147614:1727308089::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147619:2147619 [6] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147619 [6] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147619:2147619 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147619:2147619 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147618:2147618 [5] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147618 [5] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147618:2147618 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147618:2147618 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147614:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:1:128890276 libfabric:2147614:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:1:1342475888 libfabric:2147614:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:1:687522316 libfabric:2147614:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:1:1221992966 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147615 [2] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147615 [2] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147615:2147615 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147615:2147615 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Using network AWS Libfabric bolt-fd37yj2rd5-5z3nm97yap:2147620:2147620 [7] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147620 [7] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147620:2147620 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147620:2147620 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147617:2147617 [4] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147617 [4] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147617:2147617 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147617:2147617 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147619:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147619:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147616:2147616 [3] NCCL INFO cudaDriverVersion 12020 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147616 [3] NCCL INFO Bootstrap : Using eth0:240.62.43.96<0> bolt-fd37yj2rd5-5z3nm97yap:2147616:2147616 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. bolt-fd37yj2rd5-5z3nm97yap:2147616:2147616 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). libfabric:2147619:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147619:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147618:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147618:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147618:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147618:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147615:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147615:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147615:1727308090::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147615:1727308090::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) libfabric:2147619:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:2:992959481 libfabric:2147619:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:2:1685165129 libfabric:2147619:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:2:130438553 libfabric:2147619:1727308090::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:2:1599908305 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Using network AWS Libfabric bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147617:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147620:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147617:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147620:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147618:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:3:453696464 libfabric:2147618:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:3:2057976579 libfabric:2147618:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:3:1941462929 libfabric:2147618:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:3:1513274188 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Using Libfabric version 1.19 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Using CUDA driver version 12020 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Configuring AWS-specific options bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Internode latency set at 75.0 us bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Using transport protocol RDMA libfabric:2147616:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147615:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:4:1315283297 libfabric:2147615:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:4:1652614940 libfabric:2147615:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:4:2002058734 libfabric:2147615:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:4:1973750943 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Using network AWS Libfabric libfabric:2147616:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! libfabric:2147617:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147617:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) libfabric:2147620:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147620:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Using network AWS Libfabric libfabric:2147616:1727308091::core:core:cuda_hmem_dl_init():444 Failed to dlopen libnvidia-ml.so. Trying libnvidia-ml.so.1 libfabric:2147616:1727308091::core:core:cuda_gdrcopy_hmem_init():192 gdrcopy_dl_hmem_init failed! bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics) bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147617:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:5:1441131698 libfabric:2147617:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:5:1202746764 libfabric:2147617:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:5:682381850 libfabric:2147617:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:5:433154730 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0 libfabric:2147620:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:6:1110861642 libfabric:2147620:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:6:1011071073 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Using network AWS Libfabric libfabric:2147620:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:6:398106000 libfabric:2147620:1727308091::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:6:783454465 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Using network AWS Libfabric libfabric:2147616:1727308092::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::4d:a7ff:fec9:f4b5]:7:751739103 libfabric:2147616:1727308092::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::1:4bff:fe78:82fb]:7:449653708 libfabric:2147616:1727308092::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::8f:cff:fe17:9523]:7:1019013371 libfabric:2147616:1727308092::efa:ep_ctrl:efa_rdm_ep_ctrl():1055 libfabric 1.19.0amzn4.0 efa endpoint created! address: fi_addr_efa://[fe80::81:9dff:fe4d:4179]:7:944544397 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Using network AWS Libfabric bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO comm 0x56117be78020 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 86000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO comm 0x55d5c0cc5a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId ca000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO comm 0x557388a3f040 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 97000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO comm 0x555dc275fd70 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 53000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO comm 0x55e49833c880 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId b9000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO comm 0x55d73cab80c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 75000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO comm 0x55bd90054850 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId a8000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO comm 0x55748651e3c0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 64000 commId 0x39b41913d1be8c14 - Init START bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p5.48xl-topo.xml bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NVLS multicast support is available on dev 4 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Setting affinity for GPU 2 to 0fff,fffffffe,00000000,00000fff,fffffffe bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NVLS multicast support is available on dev 2 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Setting affinity for GPU 0 to 0fff,fffffffe,00000000,00000fff,fffffffe bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NVLS multicast support is available on dev 0 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NVLS multicast support is available on dev 5 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NET/OFI Libfabric provider associates MRs with domains bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Setting affinity for GPU 1 to 0fff,fffffffe,00000000,00000fff,fffffffe bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Setting affinity for GPU 3 to 0fff,fffffffe,00000000,00000fff,fffffffe bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NVLS multicast support is available on dev 3 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NVLS multicast support is available on dev 1 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NVLS multicast support is available on dev 7 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NVLS multicast support is available on dev 6 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608. bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NVLS comm 0x55d5c0cc5a30 headRank 7 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NVLS comm 0x56117be78020 headRank 3 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NVLS comm 0x555dc275fd70 headRank 0 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NVLS comm 0x55d73cab80c0 headRank 2 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NVLS comm 0x55748651e3c0 headRank 1 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NVLS comm 0x557388a3f040 headRank 4 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NVLS comm 0x55bd90054850 headRank 5 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NVLS comm 0x55e49833c880 headRank 6 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147616:2147954 [3] NCCL INFO comm 0x56117be78020 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 86000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147620:2147945 [7] NCCL INFO comm 0x55d5c0cc5a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId ca000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147615:2147943 [2] NCCL INFO comm 0x55d73cab80c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 75000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147619:2147918 [6] NCCL INFO comm 0x55e49833c880 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId b9000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147618:2147937 [5] NCCL INFO comm 0x55bd90054850 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId a8000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147614:2147905 [1] NCCL INFO comm 0x55748651e3c0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 64000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147613:2147896 [0] NCCL INFO comm 0x555dc275fd70 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 53000 commId 0x39b41913d1be8c14 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147617:2147947 [4] NCCL INFO comm 0x557388a3f040 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 97000 commId 0x39b41913d1be8c14 - Init COMPLETE [2024-09-25 23:48:18,463] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-09-25 23:48:18,465] [INFO] [logging.py:96:log_dist] [Rank 0] Creating BF16 optimizer [2024-09-25 23:48:18,720] [INFO] [utils.py:802:see_memory_usage] begin bf16_optimizer [2024-09-25 23:48:18,720] [INFO] [utils.py:803:see_memory_usage] MA 16.25 GB Max_MA 16.25 GB CA 16.27 GB Max_CA 16 GB [2024-09-25 23:48:18,721] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 212.78 GB, percent = 10.6% [2024-09-25 23:48:18,963] [INFO] [utils.py:802:see_memory_usage] end bf16_optimizer [2024-09-25 23:48:18,964] [INFO] [utils.py:803:see_memory_usage] MA 16.25 GB Max_MA 16.25 GB CA 16.27 GB Max_CA 16 GB [2024-09-25 23:48:18,964] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 211.84 GB, percent = 10.6% [2024-09-25 23:48:18,966] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2024-09-25 23:48:18,966] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-25 23:48:18,966] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-25 23:48:18,966] [INFO] [config.py:976:print] amp_enabled .................. False [2024-09-25 23:48:18,966] [INFO] [config.py:976:print] amp_params ................... False [2024-09-25 23:48:18,966] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] comms_config ................. [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] communication_data_type ...... None [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] disable_allgather ............ False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] dump_state ................... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] elasticity_enabled ........... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] fp16_enabled ................. False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] global_rank .................. 0 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] gradient_accumulation_steps .. 4 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] gradient_clipping ............ 0.0 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] memory_breakdown ............. False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] optimizer_name ............... None [2024-09-25 23:48:18,967] [INFO] [config.py:976:print] optimizer_params ............. None [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] pld_enabled .................. False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] pld_params ................... False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] prescale_gradients ........... False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] scheduler_name ............... None [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] scheduler_params ............. None [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] sparse_attention ............. None [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] steps_per_print .............. inf [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] train_batch_size ............. 32 [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 1 [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] use_node_local_storage ....... False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] weight_quantization_config ... None [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] world_size ................... 8 [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] zero_allow_untested_optimizer False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] zero_enabled ................. False [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2024-09-25 23:48:18,968] [INFO] [config.py:976:print] zero_optimization_stage ...... 0 [2024-09-25 23:48:18,968] [INFO] [config.py:962:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "train_micro_batch_size_per_gpu": 1, "train_batch_size": 32, "gradient_accumulation_steps": 4, "zero_optimization": { "stage": 0, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": "auto" }, "steps_per_print": inf } wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: zrh331. Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.18.1 wandb: Run data is saved locally in /mnt/task_runtime/lmm/dpo/wandb/run-20240925_234837-metpry44 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run 0925_dpo-0921_sft_A_math-text+math2_datax2_it1_90 wandb: ⭐️ View project at https://wandb.ai/zrh331/llava-llama3-dpo3 wandb: 🚀 View run at https://wandb.ai/zrh331/llava-llama3-dpo3/runs/metpry44 0%| | 0/1336 [00:003->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO P2P Chunksize set to 524288 bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Connected all rings bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/IPC bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO NVLS comm 0x555e0160eaf0 headRank 0 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO NVLS comm 0x55d5f79288e0 headRank 7 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO NVLS comm 0x55d76be89180 headRank 2 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO NVLS comm 0x5574b7b26020 headRank 1 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO NVLS comm 0x561193b5a920 headRank 3 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO NVLS comm 0x55bdb1679e80 headRank 5 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO NVLS comm 0x557398f9e870 headRank 4 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Connected all trees bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO NVLS comm 0x55e4b1317e10 headRank 6 nHeads 8 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 2684354560 bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO Connected NVLS tree bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer bolt-fd37yj2rd5-5z3nm97yap:2147614:2153051 [1] NCCL INFO comm 0x5574b7b26020 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 64000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147616:2153034 [3] NCCL INFO comm 0x561193b5a920 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 86000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147620:2153054 [7] NCCL INFO comm 0x55d5f79288e0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId ca000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147618:2153053 [5] NCCL INFO comm 0x55bdb1679e80 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId a8000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147613:2150664 [0] NCCL INFO comm 0x555e0160eaf0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 53000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147617:2153052 [4] NCCL INFO comm 0x557398f9e870 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 97000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147619:2153056 [6] NCCL INFO comm 0x55e4b1317e10 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId b9000 commId 0xb0975888ea877831 - Init COMPLETE bolt-fd37yj2rd5-5z3nm97yap:2147615:2153055 [2] NCCL INFO comm 0x55d76be89180 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 75000 commId 0xb0975888ea877831 - Init COMPLETE Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed Could not estimate the number of tokens of the input, floating-point operations will not be computed /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) 0%| | 1/1336 [00:15<5:42:38, 15.40s/it] {'loss': 0.6931, 'grad_norm': 59.974050584778176, 'learning_rate': 1.2195121951219512e-08, 'losses/dpo': 0.6931471824645996, 'losses/sft': 1.0480493307113647, 'losses/total': 0.6931471824645996, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -105.89979553222656, 'logps/chosen': -96.5219955444336, 'ref_logps/rejected': -105.89979553222656, 'ref_logps/chosen': -96.5219955444336, 'epoch': 0.0} 0%| | 1/1336 [00:15<5:42:38, 15.40s/it] 0%| | 2/1336 [00:20<3:33:47, 9.62s/it] {'loss': 0.6931, 'grad_norm': 50.97061327832686, 'learning_rate': 2.4390243902439023e-08, 'losses/dpo': 0.6931471824645996, 'losses/sft': 0.8721723556518555, 'losses/total': 0.6931471824645996, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -97.79839324951172, 'logps/chosen': -88.67269897460938, 'ref_logps/rejected': -97.79839324951172, 'ref_logps/chosen': -88.67269897460938, 'epoch': 0.0} 0%| | 2/1336 [00:20<3:33:47, 9.62s/it] 0%| | 3/1336 [00:26<2:56:59, 7.97s/it] {'loss': 0.6877, 'grad_norm': 51.34404612924222, 'learning_rate': 3.658536585365853e-08, 'losses/dpo': 0.6847645044326782, 'losses/sft': 0.7619496583938599, 'losses/total': 0.6847645044326782, 'rewards/chosen': 0.0075337765738368034, 'rewards/rejected': -0.003923260606825352, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.01145703811198473, 'logps/rejected': -91.552490234375, 'logps/chosen': -82.60709381103516, 'ref_logps/rejected': -91.51325988769531, 'ref_logps/chosen': -82.68243408203125, 'epoch': 0.0} 0%| | 3/1336 [00:26<2:56:59, 7.97s/it] 0%| | 4/1336 [00:32<2:37:43, 7.10s/it] {'loss': 0.6964, 'grad_norm': 55.65705030530943, 'learning_rate': 4.878048780487805e-08, 'losses/dpo': 0.7135478854179382, 'losses/sft': 1.1664021015167236, 'losses/total': 0.7135478854179382, 'rewards/chosen': -0.011832120828330517, 'rewards/rejected': -0.00585958082228899, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.005972540006041527, 'logps/rejected': -110.39165496826172, 'logps/chosen': -108.95577239990234, 'ref_logps/rejected': -110.33306121826172, 'ref_logps/chosen': -108.83744812011719, 'epoch': 0.0} 0%| | 4/1336 [00:32<2:37:43, 7.10s/it] 0%| | 5/1336 [00:38<2:26:28, 6.60s/it] {'loss': 0.6987, 'grad_norm': 55.90046801021424, 'learning_rate': 6.097560975609756e-08, 'losses/dpo': 0.6758191585540771, 'losses/sft': 0.6152290105819702, 'losses/total': 0.6758191585540771, 'rewards/chosen': -0.0022934856824576855, 'rewards/rejected': 0.007932561449706554, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.010226046666502953, 'logps/rejected': -100.73726654052734, 'logps/chosen': -96.78623962402344, 'ref_logps/rejected': -100.81658935546875, 'ref_logps/chosen': -96.76331329345703, 'epoch': 0.0} 0%| | 5/1336 [00:38<2:26:28, 6.60s/it] 0%| | 6/1336 [00:44<2:20:06, 6.32s/it] {'loss': 0.6956, 'grad_norm': 66.30510570492672, 'learning_rate': 7.317073170731706e-08, 'losses/dpo': 0.7265294194221497, 'losses/sft': 1.0552270412445068, 'losses/total': 0.7265294194221497, 'rewards/chosen': -0.0010399851016700268, 'rewards/rejected': 0.0032081662211567163, 'rewards/accuracies': 0.375, 'rewards/margins': -0.004248150624334812, 'logps/rejected': -87.60787963867188, 'logps/chosen': -83.2293701171875, 'ref_logps/rejected': -87.63996887207031, 'ref_logps/chosen': -83.2189712524414, 'epoch': 0.0} 0%| | 6/1336 [00:44<2:20:06, 6.32s/it] 1%| | 7/1336 [00:49<2:14:18, 6.06s/it] {'loss': 0.6904, 'grad_norm': 61.026766892186195, 'learning_rate': 8.536585365853659e-08, 'losses/dpo': 0.6828237175941467, 'losses/sft': 0.5018908381462097, 'losses/total': 0.6828237175941467, 'rewards/chosen': -0.0016372414538636804, 'rewards/rejected': -0.0076447282917797565, 'rewards/accuracies': 0.5, 'rewards/margins': 0.006007486954331398, 'logps/rejected': -84.92127990722656, 'logps/chosen': -76.70850372314453, 'ref_logps/rejected': -84.84483337402344, 'ref_logps/chosen': -76.69213104248047, 'epoch': 0.01} 1%| | 7/1336 [00:49<2:14:18, 6.06s/it] 1%| | 8/1336 [00:55<2:11:57, 5.96s/it] {'loss': 0.6953, 'grad_norm': 53.568933125094894, 'learning_rate': 9.75609756097561e-08, 'losses/dpo': 0.6982625722885132, 'losses/sft': 0.835720419883728, 'losses/total': 0.6982625722885132, 'rewards/chosen': -0.0032170102931559086, 'rewards/rejected': 0.0006555425934493542, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.003872552188113332, 'logps/rejected': -91.24917602539062, 'logps/chosen': -88.31718444824219, 'ref_logps/rejected': -91.25572967529297, 'ref_logps/chosen': -88.28501892089844, 'epoch': 0.01} 1%| | 8/1336 [00:55<2:11:57, 5.96s/it] 1%| | 9/1336 [01:01<2:13:39, 6.04s/it] {'loss': 0.688, 'grad_norm': 63.7463823975001, 'learning_rate': 1.097560975609756e-07, 'losses/dpo': 0.6819353103637695, 'losses/sft': 0.8514418005943298, 'losses/total': 0.6819353103637695, 'rewards/chosen': 0.0022748024202883244, 'rewards/rejected': -0.008499324321746826, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.010774126276373863, 'logps/rejected': -93.98930358886719, 'logps/chosen': -84.27828979492188, 'ref_logps/rejected': -93.904296875, 'ref_logps/chosen': -84.30104064941406, 'epoch': 0.01} 1%| | 9/1336 [01:01<2:13:39, 6.04s/it] 1%| | 10/1336 [01:07<2:13:30, 6.04s/it] {'loss': 0.6903, 'grad_norm': 70.09916529355763, 'learning_rate': 1.219512195121951e-07, 'losses/dpo': 0.6834984421730042, 'losses/sft': 1.0034970045089722, 'losses/total': 0.6834984421730042, 'rewards/chosen': 0.004701527766883373, 'rewards/rejected': -0.0015038668643683195, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.006205395795404911, 'logps/rejected': -112.92938232421875, 'logps/chosen': -104.21539306640625, 'ref_logps/rejected': -112.91433715820312, 'ref_logps/chosen': -104.26239776611328, 'epoch': 0.01} 1%| | 10/1336 [01:07<2:13:30, 6.04s/it] 1%| | 11/1336 [01:13<2:13:03, 6.03s/it] {'loss': 0.6885, 'grad_norm': 48.982005279508336, 'learning_rate': 1.3414634146341465e-07, 'losses/dpo': 0.6822197437286377, 'losses/sft': 0.6507266163825989, 'losses/total': 0.6822197437286377, 'rewards/chosen': 0.006812172941863537, 'rewards/rejected': -0.00304009928368032, 'rewards/accuracies': 0.625, 'rewards/margins': 0.009852271527051926, 'logps/rejected': -98.31290435791016, 'logps/chosen': -86.68479919433594, 'ref_logps/rejected': -98.28250122070312, 'ref_logps/chosen': -86.75291442871094, 'epoch': 0.01} 1%| | 11/1336 [01:13<2:13:03, 6.03s/it] 1%| | 12/1336 [01:19<2:09:20, 5.86s/it] {'loss': 0.696, 'grad_norm': 44.878427379921064, 'learning_rate': 1.4634146341463413e-07, 'losses/dpo': 0.6876137256622314, 'losses/sft': 0.578190267086029, 'losses/total': 0.6876137256622314, 'rewards/chosen': -0.0015541142784059048, 'rewards/rejected': 0.0037428471259772778, 'rewards/accuracies': 0.375, 'rewards/margins': -0.005296960938721895, 'logps/rejected': -78.97032165527344, 'logps/chosen': -76.83417510986328, 'ref_logps/rejected': -79.00775146484375, 'ref_logps/chosen': -76.81863403320312, 'epoch': 0.01} 1%| | 12/1336 [01:19<2:09:20, 5.86s/it] 1%| | 13/1336 [01:25<2:11:34, 5.97s/it] {'loss': 0.6923, 'grad_norm': 53.51957731256799, 'learning_rate': 1.5853658536585366e-07, 'losses/dpo': 0.691738486289978, 'losses/sft': 0.9301700592041016, 'losses/total': 0.691738486289978, 'rewards/chosen': -0.0005280462792143226, 'rewards/rejected': -0.002770650666207075, 'rewards/accuracies': 0.46875, 'rewards/margins': 0.0022426038049161434, 'logps/rejected': -114.72909545898438, 'logps/chosen': -106.34982299804688, 'ref_logps/rejected': -114.7013931274414, 'ref_logps/chosen': -106.34454345703125, 'epoch': 0.01} 1%| | 13/1336 [01:25<2:11:34, 5.97s/it] 1%| | 14/1336 [01:30<2:08:32, 5.83s/it] {'loss': 0.6946, 'grad_norm': 49.2781756015982, 'learning_rate': 1.7073170731707317e-07, 'losses/dpo': 0.6899164319038391, 'losses/sft': 0.48029160499572754, 'losses/total': 0.6899164319038391, 'rewards/chosen': -0.004526582546532154, 'rewards/rejected': -0.0019756671972572803, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.002550914417952299, 'logps/rejected': -96.45118713378906, 'logps/chosen': -87.08540344238281, 'ref_logps/rejected': -96.43143463134766, 'ref_logps/chosen': -87.04013061523438, 'epoch': 0.01} 1%| | 14/1336 [01:30<2:08:32, 5.83s/it] 1%| | 15/1336 [01:36<2:06:40, 5.75s/it] {'loss': 0.695, 'grad_norm': 83.2821807294014, 'learning_rate': 1.8292682926829268e-07, 'losses/dpo': 0.6700431108474731, 'losses/sft': 0.477091908454895, 'losses/total': 0.6700431108474731, 'rewards/chosen': -0.0007672780193388462, 'rewards/rejected': 0.0021877880208194256, 'rewards/accuracies': 0.40625, 'rewards/margins': -0.0029550669714808464, 'logps/rejected': -92.03987121582031, 'logps/chosen': -84.44003295898438, 'ref_logps/rejected': -92.06175231933594, 'ref_logps/chosen': -84.43235778808594, 'epoch': 0.01} 1%| | 15/1336 [01:36<2:06:40, 5.75s/it] 1%| | 16/1336 [01:41<2:04:04, 5.64s/it] {'loss': 0.6878, 'grad_norm': 57.54071830651204, 'learning_rate': 1.951219512195122e-07, 'losses/dpo': 0.6848957538604736, 'losses/sft': 0.42037433385849, 'losses/total': 0.6848957538604736, 'rewards/chosen': 0.005588349886238575, 'rewards/rejected': -0.005654109176248312, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.011242459528148174, 'logps/rejected': -97.41439819335938, 'logps/chosen': -90.35257720947266, 'ref_logps/rejected': -97.35786437988281, 'ref_logps/chosen': -90.40846252441406, 'epoch': 0.01} 1%| | 16/1336 [01:41<2:04:04, 5.64s/it] 1%|▏ | 17/1336 [01:47<2:04:14, 5.65s/it] {'loss': 0.68, 'grad_norm': 51.507084035213516, 'learning_rate': 2.073170731707317e-07, 'losses/dpo': 0.6908072233200073, 'losses/sft': 0.7803502082824707, 'losses/total': 0.6908072233200073, 'rewards/chosen': 0.009256690740585327, 'rewards/rejected': -0.01769128441810608, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.026947977021336555, 'logps/rejected': -100.69586944580078, 'logps/chosen': -91.31112670898438, 'ref_logps/rejected': -100.51895141601562, 'ref_logps/chosen': -91.40369415283203, 'epoch': 0.01} 1%|▏ | 17/1336 [01:47<2:04:14, 5.65s/it] 1%|▏ | 18/1336 [01:53<2:07:16, 5.79s/it] {'loss': 0.697, 'grad_norm': 78.4783212285233, 'learning_rate': 2.195121951219512e-07, 'losses/dpo': 0.6907550096511841, 'losses/sft': 0.6824198365211487, 'losses/total': 0.6907550096511841, 'rewards/chosen': -0.004249152727425098, 'rewards/rejected': 0.0030124697368592024, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.00726162176579237, 'logps/rejected': -95.31147766113281, 'logps/chosen': -81.65106201171875, 'ref_logps/rejected': -95.34159851074219, 'ref_logps/chosen': -81.60857391357422, 'epoch': 0.01} 1%|▏ | 18/1336 [01:53<2:07:16, 5.79s/it] 1%|▏ | 19/1336 [01:59<2:05:57, 5.74s/it] {'loss': 0.6924, 'grad_norm': 60.714003443662854, 'learning_rate': 2.3170731707317074e-07, 'losses/dpo': 0.6738797426223755, 'losses/sft': 1.2954185009002686, 'losses/total': 0.6738797426223755, 'rewards/chosen': 0.009589070454239845, 'rewards/rejected': 0.007222080137580633, 'rewards/accuracies': 0.46875, 'rewards/margins': 0.0023669907823204994, 'logps/rejected': -106.2554931640625, 'logps/chosen': -97.74163055419922, 'ref_logps/rejected': -106.32772827148438, 'ref_logps/chosen': -97.8375244140625, 'epoch': 0.01} 1%|▏ | 19/1336 [01:59<2:05:57, 5.74s/it] 1%|▏ | 20/1336 [02:04<2:04:57, 5.70s/it] {'loss': 0.6907, 'grad_norm': 62.96834410733016, 'learning_rate': 2.439024390243902e-07, 'losses/dpo': 0.7008021473884583, 'losses/sft': 0.2065332680940628, 'losses/total': 0.7008021473884583, 'rewards/chosen': 0.0015764713753014803, 'rewards/rejected': -0.003683194750919938, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.005259665660560131, 'logps/rejected': -72.29498291015625, 'logps/chosen': -61.521541595458984, 'ref_logps/rejected': -72.2581558227539, 'ref_logps/chosen': -61.53730392456055, 'epoch': 0.01} 1%|▏ | 20/1336 [02:04<2:04:57, 5.70s/it] 2%|▏ | 21/1336 [02:10<2:04:37, 5.69s/it] {'loss': 0.6956, 'grad_norm': 50.04066687460498, 'learning_rate': 2.5609756097560976e-07, 'losses/dpo': 0.6744053363800049, 'losses/sft': 0.7949776649475098, 'losses/total': 0.6744053363800049, 'rewards/chosen': -0.000866475747898221, 'rewards/rejected': 0.0032598613761365414, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.004126338288187981, 'logps/rejected': -94.33818054199219, 'logps/chosen': -83.53723907470703, 'ref_logps/rejected': -94.37078094482422, 'ref_logps/chosen': -83.52857208251953, 'epoch': 0.02} 2%|▏ | 21/1336 [02:10<2:04:37, 5.69s/it] 2%|▏ | 22/1336 [02:16<2:07:18, 5.81s/it] {'loss': 0.6864, 'grad_norm': 62.90105448992747, 'learning_rate': 2.682926829268293e-07, 'losses/dpo': 0.6711546182632446, 'losses/sft': 1.231006145477295, 'losses/total': 0.6711546182632446, 'rewards/chosen': 0.0018609343096613884, 'rewards/rejected': -0.012084433808922768, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.01394536904990673, 'logps/rejected': -82.03578186035156, 'logps/chosen': -75.12887573242188, 'ref_logps/rejected': -81.91493225097656, 'ref_logps/chosen': -75.14749145507812, 'epoch': 0.02} 2%|▏ | 22/1336 [02:16<2:07:18, 5.81s/it] 2%|▏ | 23/1336 [02:22<2:06:22, 5.78s/it] {'loss': 0.6883, 'grad_norm': 62.143024738756004, 'learning_rate': 2.8048780487804877e-07, 'losses/dpo': 0.713842511177063, 'losses/sft': 0.579557478427887, 'losses/total': 0.713842511177063, 'rewards/chosen': 0.014829244464635849, 'rewards/rejected': 0.0042709289118647575, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.010558316484093666, 'logps/rejected': -114.9342269897461, 'logps/chosen': -106.08462524414062, 'ref_logps/rejected': -114.97693634033203, 'ref_logps/chosen': -106.23292541503906, 'epoch': 0.02} 2%|▏ | 23/1336 [02:22<2:06:22, 5.78s/it] 2%|▏ | 24/1336 [02:27<2:04:31, 5.69s/it] {'loss': 0.6937, 'grad_norm': 45.04089833444425, 'learning_rate': 2.9268292682926825e-07, 'losses/dpo': 0.6925753951072693, 'losses/sft': 0.3869549632072449, 'losses/total': 0.6925753951072693, 'rewards/chosen': -0.008264392614364624, 'rewards/rejected': -0.007925136014819145, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.0003392565995454788, 'logps/rejected': -75.62657165527344, 'logps/chosen': -73.27450561523438, 'ref_logps/rejected': -75.54731750488281, 'ref_logps/chosen': -73.19186401367188, 'epoch': 0.02} 2%|▏ | 24/1336 [02:27<2:04:31, 5.69s/it] 2%|▏ | 25/1336 [02:33<2:03:15, 5.64s/it] {'loss': 0.6938, 'grad_norm': 108.66984631463723, 'learning_rate': 3.048780487804878e-07, 'losses/dpo': 0.6884900331497192, 'losses/sft': 0.6735142469406128, 'losses/total': 0.6884900331497192, 'rewards/chosen': -0.006176230497658253, 'rewards/rejected': -0.005271682515740395, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.0009045484475791454, 'logps/rejected': -98.73696899414062, 'logps/chosen': -98.27293395996094, 'ref_logps/rejected': -98.68425750732422, 'ref_logps/chosen': -98.21116638183594, 'epoch': 0.02} 2%|▏ | 25/1336 [02:33<2:03:15, 5.64s/it] 2%|▏ | 26/1336 [02:39<2:02:47, 5.62s/it] {'loss': 0.6934, 'grad_norm': 50.8873017268112, 'learning_rate': 3.170731707317073e-07, 'losses/dpo': 0.6929163336753845, 'losses/sft': 0.6067396402359009, 'losses/total': 0.6929163336753845, 'rewards/chosen': 0.0032984840217977762, 'rewards/rejected': 0.003234690520912409, 'rewards/accuracies': 0.53125, 'rewards/margins': 6.379315163940191e-05, 'logps/rejected': -89.82010650634766, 'logps/chosen': -91.29349517822266, 'ref_logps/rejected': -89.85245513916016, 'ref_logps/chosen': -91.32647705078125, 'epoch': 0.02} 2%|▏ | 26/1336 [02:39<2:02:47, 5.62s/it] 2%|▏ | 27/1336 [02:44<2:04:27, 5.70s/it] {'loss': 0.6978, 'grad_norm': 48.65607536152276, 'learning_rate': 3.292682926829268e-07, 'losses/dpo': 0.7131488919258118, 'losses/sft': 0.7464495897293091, 'losses/total': 0.7131488919258118, 'rewards/chosen': -0.007220965810120106, 'rewards/rejected': 0.0013507064431905746, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.008571673184633255, 'logps/rejected': -81.38693237304688, 'logps/chosen': -82.84942626953125, 'ref_logps/rejected': -81.40044403076172, 'ref_logps/chosen': -82.77721405029297, 'epoch': 0.02} 2%|▏ | 27/1336 [02:44<2:04:27, 5.70s/it] 2%|▏ | 28/1336 [02:50<2:05:55, 5.78s/it] {'loss': 0.6973, 'grad_norm': 57.138903138418506, 'learning_rate': 3.4146341463414634e-07, 'losses/dpo': 0.713499128818512, 'losses/sft': 1.2389531135559082, 'losses/total': 0.713499128818512, 'rewards/chosen': -0.0019167419523000717, 'rewards/rejected': 0.0057752421125769615, 'rewards/accuracies': 0.5, 'rewards/margins': -0.007691984996199608, 'logps/rejected': -97.15518188476562, 'logps/chosen': -92.43415069580078, 'ref_logps/rejected': -97.21293640136719, 'ref_logps/chosen': -92.41498565673828, 'epoch': 0.02} 2%|▏ | 28/1336 [02:50<2:05:55, 5.78s/it] 2%|▏ | 29/1336 [02:56<2:05:53, 5.78s/it] {'loss': 0.6951, 'grad_norm': 72.27582318243923, 'learning_rate': 3.536585365853658e-07, 'losses/dpo': 0.7139302492141724, 'losses/sft': 0.845596432685852, 'losses/total': 0.7139302492141724, 'rewards/chosen': -0.013885889202356339, 'rewards/rejected': -0.010572931729257107, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.003312957938760519, 'logps/rejected': -100.27867889404297, 'logps/chosen': -97.90177917480469, 'ref_logps/rejected': -100.17295837402344, 'ref_logps/chosen': -97.76292419433594, 'epoch': 0.02} 2%|▏ | 29/1336 [02:56<2:05:53, 5.78s/it] 2%|▏ | 30/1336 [03:02<2:04:32, 5.72s/it] {'loss': 0.683, 'grad_norm': 51.96774863576113, 'learning_rate': 3.6585365853658536e-07, 'losses/dpo': 0.6717315912246704, 'losses/sft': 0.5275499224662781, 'losses/total': 0.6717315912246704, 'rewards/chosen': 0.004077172838151455, 'rewards/rejected': -0.01704074814915657, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.02111792005598545, 'logps/rejected': -85.0166015625, 'logps/chosen': -75.03881072998047, 'ref_logps/rejected': -84.84619140625, 'ref_logps/chosen': -75.07958984375, 'epoch': 0.02} 2%|▏ | 30/1336 [03:02<2:04:32, 5.72s/it] 2%|▏ | 31/1336 [03:08<2:05:23, 5.77s/it] {'loss': 0.6945, 'grad_norm': 62.371976637718035, 'learning_rate': 3.7804878048780484e-07, 'losses/dpo': 0.7032510638237, 'losses/sft': 0.22414442896842957, 'losses/total': 0.7032510638237, 'rewards/chosen': -0.007419183850288391, 'rewards/rejected': -0.005440652370452881, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.0019785314798355103, 'logps/rejected': -104.1169662475586, 'logps/chosen': -98.24893188476562, 'ref_logps/rejected': -104.06255340576172, 'ref_logps/chosen': -98.17474365234375, 'epoch': 0.02} 2%|▏ | 31/1336 [03:08<2:05:23, 5.77s/it] 2%|▏ | 32/1336 [03:14<2:06:54, 5.84s/it] {'loss': 0.6894, 'grad_norm': 49.73091575104649, 'learning_rate': 3.902439024390244e-07, 'losses/dpo': 0.6745765805244446, 'losses/sft': 0.8292353749275208, 'losses/total': 0.6745765805244446, 'rewards/chosen': 0.005499035120010376, 'rewards/rejected': -0.0025447607040405273, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.008043794892728329, 'logps/rejected': -81.69898986816406, 'logps/chosen': -81.25952911376953, 'ref_logps/rejected': -81.67354583740234, 'ref_logps/chosen': -81.31452178955078, 'epoch': 0.02} 2%|▏ | 32/1336 [03:14<2:06:54, 5.84s/it] 2%|▏ | 33/1336 [03:19<2:05:33, 5.78s/it] {'loss': 0.6944, 'grad_norm': 53.865023828860785, 'learning_rate': 4.024390243902439e-07, 'losses/dpo': 0.6830124258995056, 'losses/sft': 0.6868780255317688, 'losses/total': 0.6830124258995056, 'rewards/chosen': -0.005026575643569231, 'rewards/rejected': -0.003217357210814953, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0018092188984155655, 'logps/rejected': -92.54141235351562, 'logps/chosen': -90.13691711425781, 'ref_logps/rejected': -92.50923156738281, 'ref_logps/chosen': -90.0866470336914, 'epoch': 0.02} 2%|▏ | 33/1336 [03:19<2:05:33, 5.78s/it] 3%|▎ | 34/1336 [03:25<2:04:17, 5.73s/it] {'loss': 0.6962, 'grad_norm': 49.069453044398365, 'learning_rate': 4.146341463414634e-07, 'losses/dpo': 0.6928682327270508, 'losses/sft': 0.5342616438865662, 'losses/total': 0.6928682327270508, 'rewards/chosen': -0.0061765192076563835, 'rewards/rejected': -0.0007020427146926522, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.00547447707504034, 'logps/rejected': -81.5191650390625, 'logps/chosen': -70.74198150634766, 'ref_logps/rejected': -81.51214599609375, 'ref_logps/chosen': -70.68021392822266, 'epoch': 0.03} 3%|▎ | 34/1336 [03:25<2:04:17, 5.73s/it] 3%|▎ | 35/1336 [03:31<2:05:06, 5.77s/it] {'loss': 0.6879, 'grad_norm': 50.594457233803816, 'learning_rate': 4.268292682926829e-07, 'losses/dpo': 0.6909371614456177, 'losses/sft': 0.15952838957309723, 'losses/total': 0.6909371614456177, 'rewards/chosen': 0.006030449643731117, 'rewards/rejected': -0.004946747329086065, 'rewards/accuracies': 0.625, 'rewards/margins': 0.01097719743847847, 'logps/rejected': -58.93048095703125, 'logps/chosen': -52.091697692871094, 'ref_logps/rejected': -58.881011962890625, 'ref_logps/chosen': -52.152000427246094, 'epoch': 0.03} 3%|▎ | 35/1336 [03:31<2:05:06, 5.77s/it] 3%|▎ | 36/1336 [03:37<2:06:18, 5.83s/it] {'loss': 0.6942, 'grad_norm': 62.35636981403165, 'learning_rate': 4.390243902439024e-07, 'losses/dpo': 0.6876667737960815, 'losses/sft': 1.0390082597732544, 'losses/total': 0.6876667737960815, 'rewards/chosen': -0.0069717480801045895, 'rewards/rejected': -0.005872136913239956, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0010996116325259209, 'logps/rejected': -99.71139526367188, 'logps/chosen': -97.5057144165039, 'ref_logps/rejected': -99.65267181396484, 'ref_logps/chosen': -97.43600463867188, 'epoch': 0.03} 3%|▎ | 36/1336 [03:37<2:06:18, 5.83s/it] 3%|▎ | 37/1336 [03:43<2:08:07, 5.92s/it] {'loss': 0.6899, 'grad_norm': 69.89842236911912, 'learning_rate': 4.5121951219512194e-07, 'losses/dpo': 0.7072774171829224, 'losses/sft': 0.6150345802307129, 'losses/total': 0.7072774171829224, 'rewards/chosen': -0.003409826662391424, 'rewards/rejected': -0.010454708710312843, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.007044881582260132, 'logps/rejected': -97.86152648925781, 'logps/chosen': -85.88082122802734, 'ref_logps/rejected': -97.75698852539062, 'ref_logps/chosen': -85.84672546386719, 'epoch': 0.03} 3%|▎ | 37/1336 [03:43<2:08:07, 5.92s/it] 3%|▎ | 38/1336 [03:49<2:07:46, 5.91s/it] {'loss': 0.6935, 'grad_norm': 68.09297575385716, 'learning_rate': 4.634146341463415e-07, 'losses/dpo': 0.6917954683303833, 'losses/sft': 1.6796200275421143, 'losses/total': 0.6917954683303833, 'rewards/chosen': -0.009004959836602211, 'rewards/rejected': -0.00894465483725071, 'rewards/accuracies': 0.53125, 'rewards/margins': -6.030488293617964e-05, 'logps/rejected': -97.7885513305664, 'logps/chosen': -102.5638427734375, 'ref_logps/rejected': -97.69910430908203, 'ref_logps/chosen': -102.47378540039062, 'epoch': 0.03} 3%|▎ | 38/1336 [03:49<2:07:46, 5.91s/it] 3%|▎ | 39/1336 [03:54<2:05:57, 5.83s/it] {'loss': 0.6957, 'grad_norm': 74.09542519144998, 'learning_rate': 4.756097560975609e-07, 'losses/dpo': 0.7104014754295349, 'losses/sft': 0.7669236660003662, 'losses/total': 0.7104014754295349, 'rewards/chosen': -0.008305096998810768, 'rewards/rejected': -0.003842422040179372, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.004462673794478178, 'logps/rejected': -96.40318298339844, 'logps/chosen': -88.935546875, 'ref_logps/rejected': -96.3647689819336, 'ref_logps/chosen': -88.85249328613281, 'epoch': 0.03} 3%|▎ | 39/1336 [03:54<2:05:57, 5.83s/it] 3%|▎ | 40/1336 [04:00<2:06:11, 5.84s/it] {'loss': 0.6824, 'grad_norm': 90.67944839668186, 'learning_rate': 4.878048780487804e-07, 'losses/dpo': 0.6790875196456909, 'losses/sft': 0.8468064069747925, 'losses/total': 0.6790875196456909, 'rewards/chosen': 0.004250587895512581, 'rewards/rejected': -0.017715759575366974, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.021966345608234406, 'logps/rejected': -95.01719665527344, 'logps/chosen': -84.25971984863281, 'ref_logps/rejected': -94.84003448486328, 'ref_logps/chosen': -84.3022232055664, 'epoch': 0.03} 3%|▎ | 40/1336 [04:00<2:06:11, 5.84s/it] 3%|▎ | 41/1336 [04:06<2:07:22, 5.90s/it] {'loss': 0.6964, 'grad_norm': 57.11476896725347, 'learning_rate': 5e-07, 'losses/dpo': 0.7078738808631897, 'losses/sft': 0.5666255950927734, 'losses/total': 0.7078738808631897, 'rewards/chosen': -0.012445923872292042, 'rewards/rejected': -0.006647817324846983, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.0057981060817837715, 'logps/rejected': -96.327392578125, 'logps/chosen': -97.73909759521484, 'ref_logps/rejected': -96.26091003417969, 'ref_logps/chosen': -97.61463165283203, 'epoch': 0.03} 3%|▎ | 41/1336 [04:06<2:07:22, 5.90s/it] 3%|▎ | 42/1336 [04:12<2:06:26, 5.86s/it] {'loss': 0.6949, 'grad_norm': 58.986868932686484, 'learning_rate': 4.999992643520848e-07, 'losses/dpo': 0.6837512850761414, 'losses/sft': 1.1926568746566772, 'losses/total': 0.6837512850761414, 'rewards/chosen': -0.01438295841217041, 'rewards/rejected': -0.011350364424288273, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.0030325944535434246, 'logps/rejected': -103.89016723632812, 'logps/chosen': -92.3856201171875, 'ref_logps/rejected': -103.77666473388672, 'ref_logps/chosen': -92.24179077148438, 'epoch': 0.03} 3%|▎ | 42/1336 [04:12<2:06:26, 5.86s/it] 3%|▎ | 43/1336 [04:18<2:07:35, 5.92s/it] {'loss': 0.6899, 'grad_norm': 68.19228307799769, 'learning_rate': 4.999970574126684e-07, 'losses/dpo': 0.6962037682533264, 'losses/sft': 0.6758909225463867, 'losses/total': 0.6962037682533264, 'rewards/chosen': -0.006382448133081198, 'rewards/rejected': -0.013423713855445385, 'rewards/accuracies': 0.625, 'rewards/margins': 0.0070412661880254745, 'logps/rejected': -82.7593994140625, 'logps/chosen': -83.64216613769531, 'ref_logps/rejected': -82.62516784667969, 'ref_logps/chosen': -83.5783462524414, 'epoch': 0.03} 3%|▎ | 43/1336 [04:18<2:07:35, 5.92s/it] 3%|▎ | 44/1336 [04:24<2:05:55, 5.85s/it] {'loss': 0.6907, 'grad_norm': 64.28075754795778, 'learning_rate': 4.999933791947391e-07, 'losses/dpo': 0.698017954826355, 'losses/sft': 0.34049952030181885, 'losses/total': 0.698017954826355, 'rewards/chosen': -0.00878117699176073, 'rewards/rejected': -0.014175841584801674, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.0053946636617183685, 'logps/rejected': -101.4595947265625, 'logps/chosen': -92.13870239257812, 'ref_logps/rejected': -101.31784057617188, 'ref_logps/chosen': -92.05088806152344, 'epoch': 0.03} 3%|▎ | 44/1336 [04:24<2:05:55, 5.85s/it] 3%|▎ | 45/1336 [04:30<2:06:45, 5.89s/it] {'loss': 0.7008, 'grad_norm': 84.27569164579617, 'learning_rate': 4.999882297199441e-07, 'losses/dpo': 0.683803915977478, 'losses/sft': 1.0301213264465332, 'losses/total': 0.683803915977478, 'rewards/chosen': -0.0227656289935112, 'rewards/rejected': -0.008241619914770126, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.014524010010063648, 'logps/rejected': -103.25550842285156, 'logps/chosen': -102.0521240234375, 'ref_logps/rejected': -103.17308807373047, 'ref_logps/chosen': -101.824462890625, 'epoch': 0.03} 3%|▎ | 45/1336 [04:30<2:06:45, 5.89s/it] 3%|▎ | 46/1336 [04:35<2:05:24, 5.83s/it] {'loss': 0.6919, 'grad_norm': 103.16928677372175, 'learning_rate': 4.999816090185887e-07, 'losses/dpo': 0.7002678513526917, 'losses/sft': 1.0810365676879883, 'losses/total': 0.7002678513526917, 'rewards/chosen': -0.02102803625166416, 'rewards/rejected': -0.02450498379766941, 'rewards/accuracies': 0.5, 'rewards/margins': 0.0034769480116665363, 'logps/rejected': -106.74534606933594, 'logps/chosen': -92.38207244873047, 'ref_logps/rejected': -106.50029754638672, 'ref_logps/chosen': -92.17179870605469, 'epoch': 0.03} 3%|▎ | 46/1336 [04:35<2:05:24, 5.83s/it] 4%|▎ | 47/1336 [04:41<2:04:42, 5.80s/it] {'loss': 0.6896, 'grad_norm': 73.30517094003568, 'learning_rate': 4.999735171296372e-07, 'losses/dpo': 0.6841588616371155, 'losses/sft': 1.128448724746704, 'losses/total': 0.6841588616371155, 'rewards/chosen': -0.010668843984603882, 'rewards/rejected': -0.018357107415795326, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.00768826249986887, 'logps/rejected': -87.70747375488281, 'logps/chosen': -89.39765930175781, 'ref_logps/rejected': -87.5239028930664, 'ref_logps/chosen': -89.29096984863281, 'epoch': 0.04} 4%|▎ | 47/1336 [04:41<2:04:42, 5.80s/it] 4%|▎ | 48/1336 [04:47<2:05:29, 5.85s/it] {'loss': 0.6921, 'grad_norm': 60.049199643228874, 'learning_rate': 4.999639541007116e-07, 'losses/dpo': 0.703823983669281, 'losses/sft': 0.7254171967506409, 'losses/total': 0.703823983669281, 'rewards/chosen': -0.014149850234389305, 'rewards/rejected': -0.016887009143829346, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.002737160073593259, 'logps/rejected': -74.22663879394531, 'logps/chosen': -70.51936340332031, 'ref_logps/rejected': -74.0577621459961, 'ref_logps/chosen': -70.37786102294922, 'epoch': 0.04} 4%|▎ | 48/1336 [04:47<2:05:29, 5.85s/it] 4%|▎ | 49/1336 [04:53<2:06:24, 5.89s/it] {'loss': 0.6899, 'grad_norm': 78.46158806898525, 'learning_rate': 4.999529199880923e-07, 'losses/dpo': 0.6808444857597351, 'losses/sft': 0.7124015688896179, 'losses/total': 0.6808444857597351, 'rewards/chosen': -0.018938392400741577, 'rewards/rejected': -0.026213916018605232, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.0072755212895572186, 'logps/rejected': -104.1199722290039, 'logps/chosen': -97.47354125976562, 'ref_logps/rejected': -103.85783386230469, 'ref_logps/chosen': -97.28414916992188, 'epoch': 0.04} 4%|▎ | 49/1336 [04:53<2:06:24, 5.89s/it] 4%|▎ | 50/1336 [04:59<2:05:30, 5.86s/it] {'loss': 0.6943, 'grad_norm': 54.676997038306276, 'learning_rate': 4.999404148567169e-07, 'losses/dpo': 0.696114182472229, 'losses/sft': 0.410934716463089, 'losses/total': 0.696114182472229, 'rewards/chosen': -0.016946006566286087, 'rewards/rejected': -0.01543875690549612, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.0015072498936206102, 'logps/rejected': -80.48365783691406, 'logps/chosen': -74.20542907714844, 'ref_logps/rejected': -80.32927703857422, 'ref_logps/chosen': -74.03597259521484, 'epoch': 0.04} 4%|▎ | 50/1336 [04:59<2:05:30, 5.86s/it] 4%|▍ | 51/1336 [05:04<2:03:22, 5.76s/it] {'loss': 0.6886, 'grad_norm': 77.51000847948441, 'learning_rate': 4.999264387801805e-07, 'losses/dpo': 0.693362832069397, 'losses/sft': 0.5452644228935242, 'losses/total': 0.693362832069397, 'rewards/chosen': -0.026029760017991066, 'rewards/rejected': -0.03662274777889252, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.010592987760901451, 'logps/rejected': -100.2743911743164, 'logps/chosen': -92.22940063476562, 'ref_logps/rejected': -99.90816497802734, 'ref_logps/chosen': -91.9690933227539, 'epoch': 0.04} 4%|▍ | 51/1336 [05:04<2:03:22, 5.76s/it] 4%|▍ | 52/1336 [05:10<2:05:05, 5.85s/it] {'loss': 0.6962, 'grad_norm': 50.71475709815057, 'learning_rate': 4.999109918407349e-07, 'losses/dpo': 0.6893306970596313, 'losses/sft': 0.8368744850158691, 'losses/total': 0.6893306970596313, 'rewards/chosen': -0.028336942195892334, 'rewards/rejected': -0.022948872298002243, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.005388069897890091, 'logps/rejected': -107.47657775878906, 'logps/chosen': -92.55049133300781, 'ref_logps/rejected': -107.24708557128906, 'ref_logps/chosen': -92.26712799072266, 'epoch': 0.04} 4%|▍ | 52/1336 [05:10<2:05:05, 5.85s/it] 4%|▍ | 53/1336 [05:16<2:05:35, 5.87s/it] {'loss': 0.694, 'grad_norm': 66.7611389295681, 'learning_rate': 4.99894074129288e-07, 'losses/dpo': 0.7079731822013855, 'losses/sft': 1.4070452451705933, 'losses/total': 0.7079731822013855, 'rewards/chosen': -0.022638272494077682, 'rewards/rejected': -0.022199522703886032, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.0004387493245303631, 'logps/rejected': -117.54583740234375, 'logps/chosen': -109.7945556640625, 'ref_logps/rejected': -117.32384490966797, 'ref_logps/chosen': -109.56817626953125, 'epoch': 0.04} 4%|▍ | 53/1336 [05:16<2:05:35, 5.87s/it] 4%|▍ | 54/1336 [05:22<2:06:35, 5.92s/it] {'loss': 0.6815, 'grad_norm': 75.39706881008834, 'learning_rate': 4.998756857454039e-07, 'losses/dpo': 0.6876960396766663, 'losses/sft': 0.8659732937812805, 'losses/total': 0.6876960396766663, 'rewards/chosen': -0.014603697694838047, 'rewards/rejected': -0.03917011618614197, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.024566419422626495, 'logps/rejected': -114.71649169921875, 'logps/chosen': -88.1456298828125, 'ref_logps/rejected': -114.32479095458984, 'ref_logps/chosen': -87.99959564208984, 'epoch': 0.04} 4%|▍ | 54/1336 [05:22<2:06:35, 5.92s/it] 4%|▍ | 55/1336 [05:29<2:07:30, 5.97s/it] {'loss': 0.6906, 'grad_norm': 56.550691853683354, 'learning_rate': 4.998558267973013e-07, 'losses/dpo': 0.7019035220146179, 'losses/sft': 0.5660150051116943, 'losses/total': 0.7019035220146179, 'rewards/chosen': -0.024120014160871506, 'rewards/rejected': -0.030072549358010292, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.005952533334493637, 'logps/rejected': -101.69986724853516, 'logps/chosen': -97.3410415649414, 'ref_logps/rejected': -101.39913940429688, 'ref_logps/chosen': -97.09984588623047, 'epoch': 0.04} 4%|▍ | 55/1336 [05:29<2:07:30, 5.97s/it] 4%|▍ | 56/1336 [05:35<2:07:52, 5.99s/it] {'loss': 0.691, 'grad_norm': 59.33852573598695, 'learning_rate': 4.99834497401854e-07, 'losses/dpo': 0.6689102649688721, 'losses/sft': 1.0339840650558472, 'losses/total': 0.6689102649688721, 'rewards/chosen': -0.02982240915298462, 'rewards/rejected': -0.03469569608569145, 'rewards/accuracies': 0.5, 'rewards/margins': 0.004873292520642281, 'logps/rejected': -97.14228057861328, 'logps/chosen': -96.47964477539062, 'ref_logps/rejected': -96.79532623291016, 'ref_logps/chosen': -96.18141174316406, 'epoch': 0.04} 4%|▍ | 56/1336 [05:35<2:07:52, 5.99s/it] 4%|▍ | 57/1336 [05:41<2:07:17, 5.97s/it] {'loss': 0.6954, 'grad_norm': 65.99850009890046, 'learning_rate': 4.998116976845892e-07, 'losses/dpo': 0.7138208746910095, 'losses/sft': 0.4773041307926178, 'losses/total': 0.7138208746910095, 'rewards/chosen': -0.0328633077442646, 'rewards/rejected': -0.029025062918663025, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.003838244127109647, 'logps/rejected': -93.3443832397461, 'logps/chosen': -86.54103088378906, 'ref_logps/rejected': -93.05413055419922, 'ref_logps/chosen': -86.21240234375, 'epoch': 0.04} 4%|▍ | 57/1336 [05:41<2:07:17, 5.97s/it] 4%|▍ | 58/1336 [05:46<2:05:49, 5.91s/it] {'loss': 0.6972, 'grad_norm': 56.58801557282007, 'learning_rate': 4.997874277796877e-07, 'losses/dpo': 0.6957101821899414, 'losses/sft': 0.4586362838745117, 'losses/total': 0.6957101821899414, 'rewards/chosen': -0.029105044901371002, 'rewards/rejected': -0.021824222058057785, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.007280820980668068, 'logps/rejected': -84.19448852539062, 'logps/chosen': -78.94558715820312, 'ref_logps/rejected': -83.97624969482422, 'ref_logps/chosen': -78.654541015625, 'epoch': 0.04} 4%|▍ | 58/1336 [05:46<2:05:49, 5.91s/it] 4%|▍ | 59/1336 [05:52<2:05:30, 5.90s/it] {'loss': 0.6923, 'grad_norm': 54.552973786300896, 'learning_rate': 4.997616878299821e-07, 'losses/dpo': 0.7016672492027283, 'losses/sft': 0.8859966397285461, 'losses/total': 0.7016672492027283, 'rewards/chosen': -0.030114391818642616, 'rewards/rejected': -0.032972536981105804, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.002858144696801901, 'logps/rejected': -99.25685119628906, 'logps/chosen': -101.95695495605469, 'ref_logps/rejected': -98.9271240234375, 'ref_logps/chosen': -101.65580749511719, 'epoch': 0.04} 4%|▍ | 59/1336 [05:52<2:05:30, 5.90s/it] 4%|▍ | 60/1336 [05:58<2:04:07, 5.84s/it] {'loss': 0.6908, 'grad_norm': 68.53947600445728, 'learning_rate': 4.997344779869566e-07, 'losses/dpo': 0.6947684288024902, 'losses/sft': 1.272596001625061, 'losses/total': 0.6947684288024902, 'rewards/chosen': -0.040184661746025085, 'rewards/rejected': -0.04585137963294983, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.0056667206808924675, 'logps/rejected': -117.95887756347656, 'logps/chosen': -111.48204040527344, 'ref_logps/rejected': -117.50035858154297, 'ref_logps/chosen': -111.0802001953125, 'epoch': 0.04} 4%|▍ | 60/1336 [05:58<2:04:07, 5.84s/it] 5%|▍ | 61/1336 [06:04<2:05:32, 5.91s/it] {'loss': 0.6889, 'grad_norm': 64.85035134413697, 'learning_rate': 4.997057984107465e-07, 'losses/dpo': 0.7077723145484924, 'losses/sft': 2.0163440704345703, 'losses/total': 0.7077723145484924, 'rewards/chosen': -0.022728189826011658, 'rewards/rejected': -0.03233005851507187, 'rewards/accuracies': 0.5, 'rewards/margins': 0.009601864032447338, 'logps/rejected': -94.69508361816406, 'logps/chosen': -94.10345458984375, 'ref_logps/rejected': -94.37178039550781, 'ref_logps/chosen': -93.87616729736328, 'epoch': 0.05} 5%|▍ | 61/1336 [06:04<2:05:32, 5.91s/it] 5%|▍ | 62/1336 [06:10<2:05:38, 5.92s/it] {'loss': 0.683, 'grad_norm': 57.033962103749666, 'learning_rate': 4.996756492701362e-07, 'losses/dpo': 0.6577162146568298, 'losses/sft': 1.2258440256118774, 'losses/total': 0.6577162146568298, 'rewards/chosen': -0.014091501012444496, 'rewards/rejected': -0.03517676889896393, 'rewards/accuracies': 0.625, 'rewards/margins': 0.021085266023874283, 'logps/rejected': -83.37911987304688, 'logps/chosen': -75.93788146972656, 'ref_logps/rejected': -83.02735137939453, 'ref_logps/chosen': -75.79696655273438, 'epoch': 0.05} 5%|▍ | 62/1336 [06:10<2:05:38, 5.92s/it] 5%|▍ | 63/1336 [06:16<2:04:44, 5.88s/it] {'loss': 0.6878, 'grad_norm': 64.88655149192442, 'learning_rate': 4.996440307425587e-07, 'losses/dpo': 0.6976509690284729, 'losses/sft': 1.2811416387557983, 'losses/total': 0.6976509690284729, 'rewards/chosen': -0.02749297395348549, 'rewards/rejected': -0.039089106023311615, 'rewards/accuracies': 0.625, 'rewards/margins': 0.011596133932471275, 'logps/rejected': -94.80174255371094, 'logps/chosen': -97.60345458984375, 'ref_logps/rejected': -94.41084289550781, 'ref_logps/chosen': -97.32852172851562, 'epoch': 0.05} 5%|▍ | 63/1336 [06:16<2:04:44, 5.88s/it] 5%|▍ | 64/1336 [06:21<2:04:26, 5.87s/it] {'loss': 0.6991, 'grad_norm': 48.343044875069346, 'learning_rate': 4.996109430140952e-07, 'losses/dpo': 0.7026593685150146, 'losses/sft': 0.92555171251297, 'losses/total': 0.7026593685150146, 'rewards/chosen': -0.040680475533008575, 'rewards/rejected': -0.029866904020309448, 'rewards/accuracies': 0.375, 'rewards/margins': -0.010813570581376553, 'logps/rejected': -100.84979248046875, 'logps/chosen': -85.5622329711914, 'ref_logps/rejected': -100.55113220214844, 'ref_logps/chosen': -85.15542602539062, 'epoch': 0.05} 5%|▍ | 64/1336 [06:21<2:04:26, 5.87s/it] 5%|▍ | 65/1336 [06:27<2:03:42, 5.84s/it] {'loss': 0.6959, 'grad_norm': 87.15478593178592, 'learning_rate': 4.995763862794729e-07, 'losses/dpo': 0.6794412136077881, 'losses/sft': 0.8385435938835144, 'losses/total': 0.6794412136077881, 'rewards/chosen': -0.032080601900815964, 'rewards/rejected': -0.027716096490621567, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.00436450494453311, 'logps/rejected': -82.49763488769531, 'logps/chosen': -78.98066711425781, 'ref_logps/rejected': -82.22047424316406, 'ref_logps/chosen': -78.65986633300781, 'epoch': 0.05} 5%|▍ | 65/1336 [06:27<2:03:42, 5.84s/it] 5%|▍ | 66/1336 [06:33<2:04:33, 5.88s/it] {'loss': 0.6985, 'grad_norm': 57.92067147366268, 'learning_rate': 4.995403607420643e-07, 'losses/dpo': 0.6896036267280579, 'losses/sft': 1.0612897872924805, 'losses/total': 0.6896036267280579, 'rewards/chosen': -0.033859796822071075, 'rewards/rejected': -0.02475971169769764, 'rewards/accuracies': 0.375, 'rewards/margins': -0.009100079536437988, 'logps/rejected': -91.94143676757812, 'logps/chosen': -87.43882751464844, 'ref_logps/rejected': -91.69383239746094, 'ref_logps/chosen': -87.10023498535156, 'epoch': 0.05} 5%|▍ | 66/1336 [06:33<2:04:33, 5.88s/it] 5%|▌ | 67/1336 [06:39<2:01:27, 5.74s/it] {'loss': 0.6804, 'grad_norm': 49.188425251693666, 'learning_rate': 4.995028666138866e-07, 'losses/dpo': 0.6929285526275635, 'losses/sft': 0.7129292488098145, 'losses/total': 0.6929285526275635, 'rewards/chosen': -0.027302134782075882, 'rewards/rejected': -0.05404958128929138, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.0267474465072155, 'logps/rejected': -91.25230407714844, 'logps/chosen': -93.0034408569336, 'ref_logps/rejected': -90.71182250976562, 'ref_logps/chosen': -92.73042297363281, 'epoch': 0.05} 5%|▌ | 67/1336 [06:39<2:01:27, 5.74s/it] 5%|▌ | 68/1336 [06:45<2:04:20, 5.88s/it] {'loss': 0.6855, 'grad_norm': 47.48055633323434, 'learning_rate': 4.994639041155993e-07, 'losses/dpo': 0.6678017377853394, 'losses/sft': 0.9797214269638062, 'losses/total': 0.6678017377853394, 'rewards/chosen': -0.025678418576717377, 'rewards/rejected': -0.042121514678001404, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.016443094238638878, 'logps/rejected': -100.600341796875, 'logps/chosen': -88.83892822265625, 'ref_logps/rejected': -100.17912292480469, 'ref_logps/chosen': -88.58214569091797, 'epoch': 0.05} 5%|▌ | 68/1336 [06:45<2:04:20, 5.88s/it] 5%|▌ | 69/1336 [06:51<2:04:42, 5.91s/it] {'loss': 0.6832, 'grad_norm': 56.23986352081886, 'learning_rate': 4.994234734765043e-07, 'losses/dpo': 0.6898762583732605, 'losses/sft': 1.0205817222595215, 'losses/total': 0.6898762583732605, 'rewards/chosen': -0.031344104558229446, 'rewards/rejected': -0.05207264423370361, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.02072853595018387, 'logps/rejected': -94.43159484863281, 'logps/chosen': -88.33454895019531, 'ref_logps/rejected': -93.9108657836914, 'ref_logps/chosen': -88.02110290527344, 'epoch': 0.05} 5%|▌ | 69/1336 [06:51<2:04:42, 5.91s/it] 5%|▌ | 70/1336 [06:57<2:03:51, 5.87s/it] {'loss': 0.6909, 'grad_norm': 65.99786463086318, 'learning_rate': 4.993815749345429e-07, 'losses/dpo': 0.698851466178894, 'losses/sft': 0.19171610474586487, 'losses/total': 0.698851466178894, 'rewards/chosen': -0.03615737706422806, 'rewards/rejected': -0.04104440659284592, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.004887028597295284, 'logps/rejected': -87.66299438476562, 'logps/chosen': -80.23670959472656, 'ref_logps/rejected': -87.2525405883789, 'ref_logps/chosen': -79.87513732910156, 'epoch': 0.05} 5%|▌ | 70/1336 [06:57<2:03:51, 5.87s/it] 5%|▌ | 71/1336 [07:03<2:05:27, 5.95s/it] {'loss': 0.6869, 'grad_norm': 75.7823031448648, 'learning_rate': 4.993382087362959e-07, 'losses/dpo': 0.6713548302650452, 'losses/sft': 0.665778636932373, 'losses/total': 0.6713548302650452, 'rewards/chosen': -0.05774197727441788, 'rewards/rejected': -0.07147706300020218, 'rewards/accuracies': 0.5, 'rewards/margins': 0.013735083863139153, 'logps/rejected': -106.31314849853516, 'logps/chosen': -100.08606719970703, 'ref_logps/rejected': -105.598388671875, 'ref_logps/chosen': -99.5086441040039, 'epoch': 0.05} 5%|▌ | 71/1336 [07:03<2:05:27, 5.95s/it] 5%|▌ | 72/1336 [07:08<2:03:42, 5.87s/it] {'loss': 0.6826, 'grad_norm': 67.40288316350006, 'learning_rate': 4.992933751369812e-07, 'losses/dpo': 0.7096676826477051, 'losses/sft': 0.6441861987113953, 'losses/total': 0.7096676826477051, 'rewards/chosen': -0.027971874922513962, 'rewards/rejected': -0.05053577572107315, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.02256390079855919, 'logps/rejected': -78.16893005371094, 'logps/chosen': -77.14959716796875, 'ref_logps/rejected': -77.66357421875, 'ref_logps/chosen': -76.869873046875, 'epoch': 0.05} 5%|▌ | 72/1336 [07:08<2:03:42, 5.87s/it] 5%|▌ | 73/1336 [07:14<2:02:55, 5.84s/it] {'loss': 0.6906, 'grad_norm': 75.67512765148375, 'learning_rate': 4.99247074400453e-07, 'losses/dpo': 0.6768147349357605, 'losses/sft': 0.8260210752487183, 'losses/total': 0.6768147349357605, 'rewards/chosen': -0.027934866026043892, 'rewards/rejected': -0.03370640054345131, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.00577153405174613, 'logps/rejected': -73.71775817871094, 'logps/chosen': -66.48274993896484, 'ref_logps/rejected': -73.38069915771484, 'ref_logps/chosen': -66.20339965820312, 'epoch': 0.05} 5%|▌ | 73/1336 [07:14<2:02:55, 5.84s/it] 6%|▌ | 74/1336 [07:20<2:01:02, 5.75s/it] {'loss': 0.6873, 'grad_norm': 62.50572535421763, 'learning_rate': 4.991993067991995e-07, 'losses/dpo': 0.7113600969314575, 'losses/sft': 0.6777662038803101, 'losses/total': 0.7113600969314575, 'rewards/chosen': -0.04953800141811371, 'rewards/rejected': -0.06298890709877014, 'rewards/accuracies': 0.625, 'rewards/margins': 0.013450901955366135, 'logps/rejected': -94.92948913574219, 'logps/chosen': -85.81665802001953, 'ref_logps/rejected': -94.29959106445312, 'ref_logps/chosen': -85.32127380371094, 'epoch': 0.06} 6%|▌ | 74/1336 [07:20<2:01:02, 5.75s/it] 6%|▌ | 75/1336 [07:26<2:01:00, 5.76s/it] {'loss': 0.6928, 'grad_norm': 70.81936652619041, 'learning_rate': 4.991500726143415e-07, 'losses/dpo': 0.7131880521774292, 'losses/sft': 0.8783124685287476, 'losses/total': 0.7131880521774292, 'rewards/chosen': -0.06070011854171753, 'rewards/rejected': -0.06255584955215454, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.00185573217459023, 'logps/rejected': -91.65453338623047, 'logps/chosen': -94.44608306884766, 'ref_logps/rejected': -91.02897644042969, 'ref_logps/chosen': -93.83908081054688, 'epoch': 0.06} 6%|▌ | 75/1336 [07:26<2:01:00, 5.76s/it] 6%|▌ | 76/1336 [07:31<2:00:26, 5.74s/it] {'loss': 0.6876, 'grad_norm': 51.28045352407927, 'learning_rate': 4.990993721356315e-07, 'losses/dpo': 0.6730806231498718, 'losses/sft': 0.5774669647216797, 'losses/total': 0.6730806231498718, 'rewards/chosen': -0.03936201333999634, 'rewards/rejected': -0.051786281168460846, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.012424267828464508, 'logps/rejected': -87.84396362304688, 'logps/chosen': -74.62962341308594, 'ref_logps/rejected': -87.32610321044922, 'ref_logps/chosen': -74.23600006103516, 'epoch': 0.06} 6%|▌ | 76/1336 [07:31<2:00:26, 5.74s/it] 6%|▌ | 77/1336 [07:37<2:01:58, 5.81s/it] {'loss': 0.6956, 'grad_norm': 73.38404894371651, 'learning_rate': 4.990472056614512e-07, 'losses/dpo': 0.6743743419647217, 'losses/sft': 0.6503850817680359, 'losses/total': 0.6743743419647217, 'rewards/chosen': -0.059629812836647034, 'rewards/rejected': -0.05708999186754227, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0025398200377821922, 'logps/rejected': -99.65370178222656, 'logps/chosen': -107.91146850585938, 'ref_logps/rejected': -99.08279418945312, 'ref_logps/chosen': -107.31517028808594, 'epoch': 0.06} 6%|▌ | 77/1336 [07:37<2:01:58, 5.81s/it] 6%|▌ | 78/1336 [07:43<2:00:48, 5.76s/it] {'loss': 0.6964, 'grad_norm': 64.76811726264836, 'learning_rate': 4.989935734988097e-07, 'losses/dpo': 0.7121404409408569, 'losses/sft': 1.1390495300292969, 'losses/total': 0.7121404409408569, 'rewards/chosen': -0.07461907714605331, 'rewards/rejected': -0.06969718635082245, 'rewards/accuracies': 0.59375, 'rewards/margins': -0.004921893123537302, 'logps/rejected': -116.91382598876953, 'logps/chosen': -111.61673736572266, 'ref_logps/rejected': -116.21685791015625, 'ref_logps/chosen': -110.87055206298828, 'epoch': 0.06} 6%|▌ | 78/1336 [07:43<2:00:48, 5.76s/it] 6%|▌ | 79/1336 [07:49<2:00:59, 5.77s/it] {'loss': 0.6858, 'grad_norm': 56.33700245226397, 'learning_rate': 4.989384759633421e-07, 'losses/dpo': 0.6977310180664062, 'losses/sft': 0.12895165383815765, 'losses/total': 0.6977310180664062, 'rewards/chosen': -0.03803490102291107, 'rewards/rejected': -0.053698815405368805, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.015663912519812584, 'logps/rejected': -78.37271118164062, 'logps/chosen': -75.70103454589844, 'ref_logps/rejected': -77.83572387695312, 'ref_logps/chosen': -75.32069396972656, 'epoch': 0.06} 6%|▌ | 79/1336 [07:49<2:00:59, 5.77s/it] 6%|▌ | 80/1336 [07:54<2:01:00, 5.78s/it] {'loss': 0.6844, 'grad_norm': 63.322037591326826, 'learning_rate': 4.988819133793076e-07, 'losses/dpo': 0.666938066482544, 'losses/sft': 0.6361566781997681, 'losses/total': 0.666938066482544, 'rewards/chosen': -0.05001307278871536, 'rewards/rejected': -0.0688231885433197, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.01881011575460434, 'logps/rejected': -96.76344299316406, 'logps/chosen': -88.87350463867188, 'ref_logps/rejected': -96.07521057128906, 'ref_logps/chosen': -88.37337493896484, 'epoch': 0.06} 6%|▌ | 80/1336 [07:54<2:01:00, 5.78s/it] 6%|▌ | 81/1336 [08:00<2:02:12, 5.84s/it] {'loss': 0.6818, 'grad_norm': 66.7902240212652, 'learning_rate': 4.988238860795872e-07, 'losses/dpo': 0.6747066974639893, 'losses/sft': 1.0494171380996704, 'losses/total': 0.6747066974639893, 'rewards/chosen': -0.05232900381088257, 'rewards/rejected': -0.07713000476360321, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.024800993502140045, 'logps/rejected': -108.00592041015625, 'logps/chosen': -91.89390563964844, 'ref_logps/rejected': -107.23461151123047, 'ref_logps/chosen': -91.37062072753906, 'epoch': 0.06} 6%|▌ | 81/1336 [08:00<2:02:12, 5.84s/it] 6%|▌ | 82/1336 [08:07<2:04:17, 5.95s/it] {'loss': 0.6841, 'grad_norm': 62.57921513478238, 'learning_rate': 4.987643944056824e-07, 'losses/dpo': 0.670839250087738, 'losses/sft': 0.6522030830383301, 'losses/total': 0.670839250087738, 'rewards/chosen': -0.04761672019958496, 'rewards/rejected': -0.06746143102645874, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.019844714552164078, 'logps/rejected': -105.08617401123047, 'logps/chosen': -96.8008041381836, 'ref_logps/rejected': -104.41156005859375, 'ref_logps/chosen': -96.32463073730469, 'epoch': 0.06} 6%|▌ | 82/1336 [08:07<2:04:17, 5.95s/it] 6%|▌ | 83/1336 [08:13<2:03:46, 5.93s/it] {'loss': 0.6731, 'grad_norm': 59.90539992738723, 'learning_rate': 4.987034387077125e-07, 'losses/dpo': 0.6596886515617371, 'losses/sft': 0.920014500617981, 'losses/total': 0.6596886515617371, 'rewards/chosen': -0.05371224135160446, 'rewards/rejected': -0.09605351090431213, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.04234126955270767, 'logps/rejected': -126.1627426147461, 'logps/chosen': -108.28489685058594, 'ref_logps/rejected': -125.20220184326172, 'ref_logps/chosen': -107.74777221679688, 'epoch': 0.06} 6%|▌ | 83/1336 [08:13<2:03:46, 5.93s/it] 6%|▋ | 84/1336 [08:18<2:01:41, 5.83s/it] {'loss': 0.677, 'grad_norm': 81.97272428592007, 'learning_rate': 4.98641019344413e-07, 'losses/dpo': 0.6500980854034424, 'losses/sft': 1.059922695159912, 'losses/total': 0.6500980854034424, 'rewards/chosen': -0.057776156812906265, 'rewards/rejected': -0.09169187396764755, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.03391571715474129, 'logps/rejected': -118.4371566772461, 'logps/chosen': -98.74504089355469, 'ref_logps/rejected': -117.5202407836914, 'ref_logps/chosen': -98.16727447509766, 'epoch': 0.06} 6%|▋ | 84/1336 [08:18<2:01:41, 5.83s/it] 6%|▋ | 85/1336 [08:24<2:00:42, 5.79s/it] {'loss': 0.6682, 'grad_norm': 78.52467796553555, 'learning_rate': 4.985771366831332e-07, 'losses/dpo': 0.6582998037338257, 'losses/sft': 1.0659692287445068, 'losses/total': 0.6582998037338257, 'rewards/chosen': -0.029097914695739746, 'rewards/rejected': -0.08158441632986069, 'rewards/accuracies': 0.75, 'rewards/margins': 0.052486494183540344, 'logps/rejected': -88.36016082763672, 'logps/chosen': -74.18081665039062, 'ref_logps/rejected': -87.5443115234375, 'ref_logps/chosen': -73.88983154296875, 'epoch': 0.06} 6%|▋ | 85/1336 [08:24<2:00:42, 5.79s/it] 6%|▋ | 86/1336 [08:30<2:00:33, 5.79s/it] {'loss': 0.6981, 'grad_norm': 72.80380390939733, 'learning_rate': 4.985117910998344e-07, 'losses/dpo': 0.6858033537864685, 'losses/sft': 0.7866793870925903, 'losses/total': 0.6858033537864685, 'rewards/chosen': -0.048946887254714966, 'rewards/rejected': -0.039768025279045105, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.009178864769637585, 'logps/rejected': -64.92509460449219, 'logps/chosen': -66.82148742675781, 'ref_logps/rejected': -64.52741241455078, 'ref_logps/chosen': -66.33202362060547, 'epoch': 0.06} 6%|▋ | 86/1336 [08:30<2:00:33, 5.79s/it] 7%|▋ | 87/1336 [08:35<1:59:00, 5.72s/it] {'loss': 0.6876, 'grad_norm': 65.705312170395, 'learning_rate': 4.984449829790873e-07, 'losses/dpo': 0.6801273822784424, 'losses/sft': 1.283775806427002, 'losses/total': 0.6801273822784424, 'rewards/chosen': -0.06537526845932007, 'rewards/rejected': -0.07831262052059174, 'rewards/accuracies': 0.4375, 'rewards/margins': 0.012937350198626518, 'logps/rejected': -101.1171646118164, 'logps/chosen': -95.60189056396484, 'ref_logps/rejected': -100.33403778076172, 'ref_logps/chosen': -94.94813537597656, 'epoch': 0.07} 7%|▋ | 87/1336 [08:35<1:59:00, 5.72s/it] 7%|▋ | 88/1336 [08:41<1:59:49, 5.76s/it] {'loss': 0.6779, 'grad_norm': 66.84256392715615, 'learning_rate': 4.983767127140698e-07, 'losses/dpo': 0.6488788723945618, 'losses/sft': 0.9031060934066772, 'losses/total': 0.6488788723945618, 'rewards/chosen': -0.0595526359975338, 'rewards/rejected': -0.09346791356801987, 'rewards/accuracies': 0.625, 'rewards/margins': 0.03391527757048607, 'logps/rejected': -111.51509857177734, 'logps/chosen': -98.46894836425781, 'ref_logps/rejected': -110.58041381835938, 'ref_logps/chosen': -97.87342071533203, 'epoch': 0.07} 7%|▋ | 88/1336 [08:41<1:59:49, 5.76s/it] 7%|▋ | 89/1336 [08:47<2:00:06, 5.78s/it] {'loss': 0.6985, 'grad_norm': 71.87589471674775, 'learning_rate': 4.983069807065651e-07, 'losses/dpo': 0.6811867356300354, 'losses/sft': 0.5975565314292908, 'losses/total': 0.6811867356300354, 'rewards/chosen': -0.06895739585161209, 'rewards/rejected': -0.06065075099468231, 'rewards/accuracies': 0.5, 'rewards/margins': -0.00830664299428463, 'logps/rejected': -101.11985778808594, 'logps/chosen': -102.52367401123047, 'ref_logps/rejected': -100.51335144042969, 'ref_logps/chosen': -101.83409118652344, 'epoch': 0.07} 7%|▋ | 89/1336 [08:47<2:00:06, 5.78s/it] 7%|▋ | 90/1336 [08:53<2:01:49, 5.87s/it] {'loss': 0.6859, 'grad_norm': 60.313333614300596, 'learning_rate': 4.982357873669588e-07, 'losses/dpo': 0.6559918522834778, 'losses/sft': 0.692446231842041, 'losses/total': 0.6559918522834778, 'rewards/chosen': -0.05094433203339577, 'rewards/rejected': -0.06794217228889465, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.016997840255498886, 'logps/rejected': -100.54303741455078, 'logps/chosen': -95.91080474853516, 'ref_logps/rejected': -99.86360931396484, 'ref_logps/chosen': -95.4013671875, 'epoch': 0.07} 7%|▋ | 90/1336 [08:53<2:01:49, 5.87s/it] 7%|▋ | 91/1336 [08:59<2:00:17, 5.80s/it] {'loss': 0.6943, 'grad_norm': 69.83363089425991, 'learning_rate': 4.981631331142367e-07, 'losses/dpo': 0.6911967396736145, 'losses/sft': 0.7096028923988342, 'losses/total': 0.6911967396736145, 'rewards/chosen': -0.0705660730600357, 'rewards/rejected': -0.07023688405752182, 'rewards/accuracies': 0.40625, 'rewards/margins': -0.00032919086515903473, 'logps/rejected': -88.71197509765625, 'logps/chosen': -86.53044891357422, 'ref_logps/rejected': -88.00960540771484, 'ref_logps/chosen': -85.82479095458984, 'epoch': 0.07} 7%|▋ | 91/1336 [08:59<2:00:17, 5.80s/it] 7%|▋ | 92/1336 [09:05<2:03:06, 5.94s/it] {'loss': 0.6762, 'grad_norm': 50.47320602161268, 'learning_rate': 4.980890183759825e-07, 'losses/dpo': 0.7062295079231262, 'losses/sft': 0.28434550762176514, 'losses/total': 0.7062295079231262, 'rewards/chosen': -0.04204845428466797, 'rewards/rejected': -0.07821141928434372, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.03616296499967575, 'logps/rejected': -91.15997314453125, 'logps/chosen': -73.81181335449219, 'ref_logps/rejected': -90.37786865234375, 'ref_logps/chosen': -73.39132690429688, 'epoch': 0.07} 7%|▋ | 92/1336 [09:05<2:03:06, 5.94s/it] 7%|▋ | 93/1336 [09:10<2:01:24, 5.86s/it] {'loss': 0.6774, 'grad_norm': 54.73014552539818, 'learning_rate': 4.980134435883749e-07, 'losses/dpo': 0.6859480142593384, 'losses/sft': 0.791650116443634, 'losses/total': 0.6859480142593384, 'rewards/chosen': -0.06241985782980919, 'rewards/rejected': -0.0961245745420456, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.0337047204375267, 'logps/rejected': -100.53861999511719, 'logps/chosen': -95.4791488647461, 'ref_logps/rejected': -99.5773696899414, 'ref_logps/chosen': -94.8549575805664, 'epoch': 0.07} 7%|▋ | 93/1336 [09:10<2:01:24, 5.86s/it] 7%|▋ | 94/1336 [09:16<1:59:02, 5.75s/it] {'loss': 0.6773, 'grad_norm': 56.3163679955879, 'learning_rate': 4.979364091961855e-07, 'losses/dpo': 0.6644335389137268, 'losses/sft': 1.045114517211914, 'losses/total': 0.6644335389137268, 'rewards/chosen': -0.05425192043185234, 'rewards/rejected': -0.08798475563526154, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.033732835203409195, 'logps/rejected': -93.63600158691406, 'logps/chosen': -78.08065795898438, 'ref_logps/rejected': -92.75614929199219, 'ref_logps/chosen': -77.53814697265625, 'epoch': 0.07} 7%|▋ | 94/1336 [09:16<1:59:02, 5.75s/it] 7%|▋ | 95/1336 [09:21<1:57:32, 5.68s/it] {'loss': 0.6786, 'grad_norm': 185.0654204298231, 'learning_rate': 4.978579156527758e-07, 'losses/dpo': 0.7137663960456848, 'losses/sft': 1.143051266670227, 'losses/total': 0.7137663960456848, 'rewards/chosen': -0.049528833478689194, 'rewards/rejected': -0.08082681894302368, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.03129799664020538, 'logps/rejected': -92.92390441894531, 'logps/chosen': -82.78665924072266, 'ref_logps/rejected': -92.11564636230469, 'ref_logps/chosen': -82.29136657714844, 'epoch': 0.07} 7%|▋ | 95/1336 [09:22<1:57:32, 5.68s/it] 7%|▋ | 96/1336 [09:27<1:57:39, 5.69s/it] {'loss': 0.6828, 'grad_norm': 61.95656223839921, 'learning_rate': 4.977779634200946e-07, 'losses/dpo': 0.6933457255363464, 'losses/sft': 0.322223961353302, 'losses/total': 0.6933457255363464, 'rewards/chosen': -0.08600245416164398, 'rewards/rejected': -0.10936033725738525, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.023357883095741272, 'logps/rejected': -115.40925598144531, 'logps/chosen': -106.30671691894531, 'ref_logps/rejected': -114.3156509399414, 'ref_logps/chosen': -105.44669342041016, 'epoch': 0.07} 7%|▋ | 96/1336 [09:27<1:57:39, 5.69s/it] 7%|▋ | 97/1336 [09:33<1:56:24, 5.64s/it] {'loss': 0.6701, 'grad_norm': 55.377156820552614, 'learning_rate': 4.976965529686756e-07, 'losses/dpo': 0.6279661655426025, 'losses/sft': 1.0962822437286377, 'losses/total': 0.6279661655426025, 'rewards/chosen': -0.06379528343677521, 'rewards/rejected': -0.11491156369447708, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.051116280257701874, 'logps/rejected': -99.21058654785156, 'logps/chosen': -87.31617736816406, 'ref_logps/rejected': -98.06147766113281, 'ref_logps/chosen': -86.67822265625, 'epoch': 0.07} 7%|▋ | 97/1336 [09:33<1:56:24, 5.64s/it] 7%|▋ | 98/1336 [09:38<1:56:01, 5.62s/it] {'loss': 0.6817, 'grad_norm': 67.79994549538101, 'learning_rate': 4.976136847776338e-07, 'losses/dpo': 0.6786997318267822, 'losses/sft': 0.7206512093544006, 'losses/total': 0.6786997318267822, 'rewards/chosen': -0.08322081714868546, 'rewards/rejected': -0.10877473652362823, 'rewards/accuracies': 0.625, 'rewards/margins': 0.025553930550813675, 'logps/rejected': -110.78470611572266, 'logps/chosen': -103.6918716430664, 'ref_logps/rejected': -109.69694519042969, 'ref_logps/chosen': -102.85965728759766, 'epoch': 0.07} 7%|▋ | 98/1336 [09:38<1:56:01, 5.62s/it] 7%|▋ | 99/1336 [09:44<1:55:06, 5.58s/it] {'loss': 0.6739, 'grad_norm': 86.12876250487602, 'learning_rate': 4.975293593346643e-07, 'losses/dpo': 0.691582202911377, 'losses/sft': 1.7708208560943604, 'losses/total': 0.691582202911377, 'rewards/chosen': -0.1080462783575058, 'rewards/rejected': -0.1500636786222458, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.04201740771532059, 'logps/rejected': -127.66023254394531, 'logps/chosen': -130.41087341308594, 'ref_logps/rejected': -126.15959167480469, 'ref_logps/chosen': -129.33041381835938, 'epoch': 0.07} 7%|▋ | 99/1336 [09:44<1:55:06, 5.58s/it] 7%|▋ | 100/1336 [09:50<1:56:11, 5.64s/it] {'loss': 0.6889, 'grad_norm': 74.91690569518771, 'learning_rate': 4.974435771360376e-07, 'losses/dpo': 0.7068099975585938, 'losses/sft': 0.5260930061340332, 'losses/total': 0.7068099975585938, 'rewards/chosen': -0.07776672393083572, 'rewards/rejected': -0.08833464235067368, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.010567913763225079, 'logps/rejected': -74.41163635253906, 'logps/chosen': -74.93022155761719, 'ref_logps/rejected': -73.52828979492188, 'ref_logps/chosen': -74.15255737304688, 'epoch': 0.07} 7%|▋ | 100/1336 [09:50<1:56:11, 5.64s/it] 8%|▊ | 101/1336 [09:56<1:57:56, 5.73s/it] {'loss': 0.6599, 'grad_norm': 64.05353463311741, 'learning_rate': 4.973563386865974e-07, 'losses/dpo': 0.6590644717216492, 'losses/sft': 0.6268046498298645, 'losses/total': 0.6590644717216492, 'rewards/chosen': -0.05056170001626015, 'rewards/rejected': -0.12101025134325027, 'rewards/accuracies': 0.75, 'rewards/margins': 0.07044855505228043, 'logps/rejected': -97.04532623291016, 'logps/chosen': -80.50483703613281, 'ref_logps/rejected': -95.83522033691406, 'ref_logps/chosen': -79.99922180175781, 'epoch': 0.08} 8%|▊ | 101/1336 [09:56<1:57:56, 5.73s/it] 8%|▊ | 102/1336 [10:01<1:57:32, 5.72s/it] {'loss': 0.6863, 'grad_norm': 71.56214143754842, 'learning_rate': 4.972676444997583e-07, 'losses/dpo': 0.7365929484367371, 'losses/sft': 0.485510915517807, 'losses/total': 0.7365929484367371, 'rewards/chosen': -0.09043414890766144, 'rewards/rejected': -0.10843917727470398, 'rewards/accuracies': 0.5, 'rewards/margins': 0.01800503209233284, 'logps/rejected': -80.65179443359375, 'logps/chosen': -72.7188949584961, 'ref_logps/rejected': -79.56739807128906, 'ref_logps/chosen': -71.81455993652344, 'epoch': 0.08} 8%|▊ | 102/1336 [10:01<1:57:32, 5.72s/it] 8%|▊ | 103/1336 [10:07<1:58:42, 5.78s/it] {'loss': 0.6683, 'grad_norm': 61.33439822240008, 'learning_rate': 4.971774950975015e-07, 'losses/dpo': 0.6355347633361816, 'losses/sft': 0.7201902270317078, 'losses/total': 0.6355347633361816, 'rewards/chosen': -0.07107002288103104, 'rewards/rejected': -0.12439003586769104, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.053320012986660004, 'logps/rejected': -110.69197845458984, 'logps/chosen': -103.31317901611328, 'ref_logps/rejected': -109.44808197021484, 'ref_logps/chosen': -102.60247802734375, 'epoch': 0.08} 8%|▊ | 103/1336 [10:07<1:58:42, 5.78s/it] 8%|▊ | 104/1336 [10:13<1:57:34, 5.73s/it] {'loss': 0.6713, 'grad_norm': 79.43470565959389, 'learning_rate': 4.97085891010373e-07, 'losses/dpo': 0.6458479762077332, 'losses/sft': 0.4618457853794098, 'losses/total': 0.6458479762077332, 'rewards/chosen': -0.10055407881736755, 'rewards/rejected': -0.1489516794681549, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.04839760810136795, 'logps/rejected': -123.46762084960938, 'logps/chosen': -109.30094909667969, 'ref_logps/rejected': -121.97810363769531, 'ref_logps/chosen': -108.29540252685547, 'epoch': 0.08} 8%|▊ | 104/1336 [10:13<1:57:34, 5.73s/it] 8%|▊ | 105/1336 [10:19<1:59:22, 5.82s/it] {'loss': 0.6715, 'grad_norm': 100.11676456333804, 'learning_rate': 4.969928327774797e-07, 'losses/dpo': 0.7077957391738892, 'losses/sft': 0.29774361848831177, 'losses/total': 0.7077957391738892, 'rewards/chosen': -0.07914724946022034, 'rewards/rejected': -0.12719115614891052, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.048043906688690186, 'logps/rejected': -94.40913391113281, 'logps/chosen': -88.98631286621094, 'ref_logps/rejected': -93.13722229003906, 'ref_logps/chosen': -88.1948471069336, 'epoch': 0.08} 8%|▊ | 105/1336 [10:19<1:59:22, 5.82s/it] 8%|▊ | 106/1336 [10:24<1:56:54, 5.70s/it] {'loss': 0.6673, 'grad_norm': 86.83304132667494, 'learning_rate': 4.968983209464862e-07, 'losses/dpo': 0.6833786368370056, 'losses/sft': 0.8973764181137085, 'losses/total': 0.6833786368370056, 'rewards/chosen': -0.07352153211832047, 'rewards/rejected': -0.12968051433563232, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.05615898221731186, 'logps/rejected': -89.24547576904297, 'logps/chosen': -80.76592254638672, 'ref_logps/rejected': -87.94866943359375, 'ref_logps/chosen': -80.03070068359375, 'epoch': 0.08} 8%|▊ | 106/1336 [10:24<1:56:54, 5.70s/it] 8%|▊ | 107/1336 [10:30<1:57:16, 5.73s/it] {'loss': 0.6927, 'grad_norm': 53.67490453262972, 'learning_rate': 4.968023560736121e-07, 'losses/dpo': 0.6981375813484192, 'losses/sft': 0.22609184682369232, 'losses/total': 0.6981375813484192, 'rewards/chosen': -0.10389627516269684, 'rewards/rejected': -0.10819172859191895, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.004295450169593096, 'logps/rejected': -93.75114440917969, 'logps/chosen': -82.72154235839844, 'ref_logps/rejected': -92.66921997070312, 'ref_logps/chosen': -81.68257904052734, 'epoch': 0.08} 8%|▊ | 107/1336 [10:30<1:57:16, 5.73s/it] 8%|▊ | 108/1336 [10:35<1:55:54, 5.66s/it] {'loss': 0.6914, 'grad_norm': 76.67300628766606, 'learning_rate': 4.96704938723628e-07, 'losses/dpo': 0.7207042574882507, 'losses/sft': 0.8529693484306335, 'losses/total': 0.7207042574882507, 'rewards/chosen': -0.12965452671051025, 'rewards/rejected': -0.1353469043970108, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.005692359525710344, 'logps/rejected': -108.7225341796875, 'logps/chosen': -102.84796905517578, 'ref_logps/rejected': -107.36907196044922, 'ref_logps/chosen': -101.55142211914062, 'epoch': 0.08} 8%|▊ | 108/1336 [10:35<1:55:54, 5.66s/it] 8%|▊ | 109/1336 [10:42<1:58:26, 5.79s/it] {'loss': 0.6777, 'grad_norm': 100.12104696216377, 'learning_rate': 4.966060694698532e-07, 'losses/dpo': 0.6511812210083008, 'losses/sft': 0.7598628997802734, 'losses/total': 0.6511812210083008, 'rewards/chosen': -0.10896223783493042, 'rewards/rejected': -0.1420867145061493, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.03312448039650917, 'logps/rejected': -95.9395980834961, 'logps/chosen': -91.55227661132812, 'ref_logps/rejected': -94.51873779296875, 'ref_logps/chosen': -90.46266174316406, 'epoch': 0.08} 8%|▊ | 109/1336 [10:42<1:58:26, 5.79s/it] 8%|▊ | 110/1336 [10:47<1:56:13, 5.69s/it] {'loss': 0.6949, 'grad_norm': 48.2512698624084, 'learning_rate': 4.965057488941513e-07, 'losses/dpo': 0.6899063587188721, 'losses/sft': 0.9825999140739441, 'losses/total': 0.6899063587188721, 'rewards/chosen': -0.08496502786874771, 'rewards/rejected': -0.08426479250192642, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.0007002444472163916, 'logps/rejected': -82.44159698486328, 'logps/chosen': -82.87835693359375, 'ref_logps/rejected': -81.59894561767578, 'ref_logps/chosen': -82.02870178222656, 'epoch': 0.08} 8%|▊ | 110/1336 [10:47<1:56:13, 5.69s/it] 8%|▊ | 111/1336 [10:53<1:58:07, 5.79s/it] {'loss': 0.6697, 'grad_norm': 53.40237280542307, 'learning_rate': 4.964039775869271e-07, 'losses/dpo': 0.7204440832138062, 'losses/sft': 1.3529318571090698, 'losses/total': 0.7204440832138062, 'rewards/chosen': -0.12340886890888214, 'rewards/rejected': -0.1764829009771347, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.053074028342962265, 'logps/rejected': -107.21756744384766, 'logps/chosen': -92.53004455566406, 'ref_logps/rejected': -105.45274353027344, 'ref_logps/chosen': -91.29595947265625, 'epoch': 0.08} 8%|▊ | 111/1336 [10:53<1:58:07, 5.79s/it] 8%|▊ | 112/1336 [10:59<1:58:18, 5.80s/it] {'loss': 0.6724, 'grad_norm': 87.60371199106952, 'learning_rate': 4.963007561471235e-07, 'losses/dpo': 0.6549930572509766, 'losses/sft': 0.501497209072113, 'losses/total': 0.6549930572509766, 'rewards/chosen': -0.11795705556869507, 'rewards/rejected': -0.1660975217819214, 'rewards/accuracies': 0.625, 'rewards/margins': 0.04814044386148453, 'logps/rejected': -107.78184509277344, 'logps/chosen': -104.42776489257812, 'ref_logps/rejected': -106.12086486816406, 'ref_logps/chosen': -103.24819946289062, 'epoch': 0.08} 8%|▊ | 112/1336 [10:59<1:58:18, 5.80s/it] 8%|▊ | 113/1336 [11:05<1:57:39, 5.77s/it] {'loss': 0.6659, 'grad_norm': 60.45216240687134, 'learning_rate': 4.961960851822176e-07, 'losses/dpo': 0.624608039855957, 'losses/sft': 0.6458165049552917, 'losses/total': 0.624608039855957, 'rewards/chosen': -0.12265896797180176, 'rewards/rejected': -0.1835458129644394, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.06088685244321823, 'logps/rejected': -104.56695556640625, 'logps/chosen': -87.74386596679688, 'ref_logps/rejected': -102.73149108886719, 'ref_logps/chosen': -86.51728057861328, 'epoch': 0.08} 8%|▊ | 113/1336 [11:05<1:57:39, 5.77s/it] 9%|▊ | 114/1336 [11:11<1:58:40, 5.83s/it] {'loss': 0.6876, 'grad_norm': 166.19651344764523, 'learning_rate': 4.960899653082173e-07, 'losses/dpo': 0.6737421751022339, 'losses/sft': 0.8412036895751953, 'losses/total': 0.6737421751022339, 'rewards/chosen': -0.1340520828962326, 'rewards/rejected': -0.15207132697105408, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.01801922731101513, 'logps/rejected': -101.04169464111328, 'logps/chosen': -100.76226806640625, 'ref_logps/rejected': -99.52098083496094, 'ref_logps/chosen': -99.42174530029297, 'epoch': 0.09} 9%|▊ | 114/1336 [11:11<1:58:40, 5.83s/it] 9%|▊ | 115/1336 [11:16<1:58:44, 5.83s/it] {'loss': 0.661, 'grad_norm': 55.87954932345798, 'learning_rate': 4.959823971496574e-07, 'losses/dpo': 0.640383780002594, 'losses/sft': 0.52694171667099, 'losses/total': 0.640383780002594, 'rewards/chosen': -0.09292107820510864, 'rewards/rejected': -0.16199873387813568, 'rewards/accuracies': 0.75, 'rewards/margins': 0.06907765567302704, 'logps/rejected': -86.14532470703125, 'logps/chosen': -71.24462127685547, 'ref_logps/rejected': -84.52534484863281, 'ref_logps/chosen': -70.31541442871094, 'epoch': 0.09} 9%|▊ | 115/1336 [11:16<1:58:44, 5.83s/it] 9%|▊ | 116/1336 [11:22<2:00:18, 5.92s/it] {'loss': 0.6786, 'grad_norm': 100.1165967979672, 'learning_rate': 4.958733813395962e-07, 'losses/dpo': 0.6500837206840515, 'losses/sft': 0.6048222780227661, 'losses/total': 0.6500837206840515, 'rewards/chosen': -0.10733576864004135, 'rewards/rejected': -0.1431446224451065, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.03580884635448456, 'logps/rejected': -97.9770278930664, 'logps/chosen': -87.6973876953125, 'ref_logps/rejected': -96.54557800292969, 'ref_logps/chosen': -86.62403106689453, 'epoch': 0.09} 9%|▊ | 116/1336 [11:22<2:00:18, 5.92s/it] 9%|▉ | 117/1336 [11:28<2:00:29, 5.93s/it] {'loss': 0.6667, 'grad_norm': 72.15921930322168, 'learning_rate': 4.95762918519612e-07, 'losses/dpo': 0.6510855555534363, 'losses/sft': 0.6643170118331909, 'losses/total': 0.6510855555534363, 'rewards/chosen': -0.09567861258983612, 'rewards/rejected': -0.1550256311893463, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.05934703350067139, 'logps/rejected': -94.06840515136719, 'logps/chosen': -84.52169799804688, 'ref_logps/rejected': -92.51815032958984, 'ref_logps/chosen': -83.56491088867188, 'epoch': 0.09} 9%|▉ | 117/1336 [11:28<2:00:29, 5.93s/it] 9%|▉ | 118/1336 [11:35<2:02:12, 6.02s/it] {'loss': 0.6827, 'grad_norm': 56.611923661586374, 'learning_rate': 4.956510093397983e-07, 'losses/dpo': 0.6635411381721497, 'losses/sft': 1.070958137512207, 'losses/total': 0.6635411381721497, 'rewards/chosen': -0.12911465764045715, 'rewards/rejected': -0.15668244659900665, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.02756778709590435, 'logps/rejected': -100.63397216796875, 'logps/chosen': -91.15338134765625, 'ref_logps/rejected': -99.06715393066406, 'ref_logps/chosen': -89.86223602294922, 'epoch': 0.09} 9%|▉ | 118/1336 [11:35<2:02:12, 6.02s/it] 9%|▉ | 119/1336 [11:40<1:59:27, 5.89s/it] {'loss': 0.6583, 'grad_norm': 55.09337817935897, 'learning_rate': 4.955376544587615e-07, 'losses/dpo': 0.6148874759674072, 'losses/sft': 1.4517148733139038, 'losses/total': 0.6148874759674072, 'rewards/chosen': -0.15510162711143494, 'rewards/rejected': -0.2362377792596817, 'rewards/accuracies': 0.75, 'rewards/margins': 0.08113616704940796, 'logps/rejected': -116.99227142333984, 'logps/chosen': -104.65599822998047, 'ref_logps/rejected': -114.62989807128906, 'ref_logps/chosen': -103.10497283935547, 'epoch': 0.09} 9%|▉ | 119/1336 [11:40<1:59:27, 5.89s/it] 9%|▉ | 120/1336 [11:46<1:58:52, 5.87s/it] {'loss': 0.6708, 'grad_norm': 47.61258025940019, 'learning_rate': 4.954228545436156e-07, 'losses/dpo': 0.6315573453903198, 'losses/sft': 0.8897170424461365, 'losses/total': 0.6315573453903198, 'rewards/chosen': -0.07921842485666275, 'rewards/rejected': -0.13102658092975616, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.051808152347803116, 'logps/rejected': -74.24674987792969, 'logps/chosen': -69.296630859375, 'ref_logps/rejected': -72.93648529052734, 'ref_logps/chosen': -68.50445556640625, 'epoch': 0.09} 9%|▉ | 120/1336 [11:46<1:58:52, 5.87s/it] 9%|▉ | 121/1336 [11:52<1:57:58, 5.83s/it] {'loss': 0.702, 'grad_norm': 61.24328103094695, 'learning_rate': 4.953066102699795e-07, 'losses/dpo': 0.7374367713928223, 'losses/sft': 1.2203149795532227, 'losses/total': 0.7374367713928223, 'rewards/chosen': -0.1498635709285736, 'rewards/rejected': -0.1371089518070221, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.012754621915519238, 'logps/rejected': -89.00853729248047, 'logps/chosen': -89.46236419677734, 'ref_logps/rejected': -87.63744354248047, 'ref_logps/chosen': -87.9637222290039, 'epoch': 0.09} 9%|▉ | 121/1336 [11:52<1:57:58, 5.83s/it] 9%|▉ | 122/1336 [11:58<1:57:47, 5.82s/it] {'loss': 0.6633, 'grad_norm': 62.78467413383069, 'learning_rate': 4.951889223219717e-07, 'losses/dpo': 0.6107309460639954, 'losses/sft': 0.5436482429504395, 'losses/total': 0.6107309460639954, 'rewards/chosen': -0.14370261132717133, 'rewards/rejected': -0.21213969588279724, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.06843708455562592, 'logps/rejected': -114.30026245117188, 'logps/chosen': -105.12085723876953, 'ref_logps/rejected': -112.17886352539062, 'ref_logps/chosen': -103.68383026123047, 'epoch': 0.09} 9%|▉ | 122/1336 [11:58<1:57:47, 5.82s/it] 9%|▉ | 123/1336 [12:03<1:57:34, 5.82s/it] {'loss': 0.6734, 'grad_norm': 68.62057227909783, 'learning_rate': 4.950697913922075e-07, 'losses/dpo': 0.585090160369873, 'losses/sft': 1.1336805820465088, 'losses/total': 0.585090160369873, 'rewards/chosen': -0.13738934695720673, 'rewards/rejected': -0.18614710867404938, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.048757750540971756, 'logps/rejected': -99.91390991210938, 'logps/chosen': -97.43536376953125, 'ref_logps/rejected': -98.05244445800781, 'ref_logps/chosen': -96.06147766113281, 'epoch': 0.09} 9%|▉ | 123/1336 [12:03<1:57:34, 5.82s/it] 9%|▉ | 124/1336 [12:09<1:55:37, 5.72s/it] {'loss': 0.6664, 'grad_norm': 59.35928281258265, 'learning_rate': 4.949492181817943e-07, 'losses/dpo': 0.7265645265579224, 'losses/sft': 0.7230411767959595, 'losses/total': 0.7265645265579224, 'rewards/chosen': -0.17988818883895874, 'rewards/rejected': -0.24427875876426697, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.06439057737588882, 'logps/rejected': -124.33409118652344, 'logps/chosen': -106.76986694335938, 'ref_logps/rejected': -121.89129638671875, 'ref_logps/chosen': -104.97098541259766, 'epoch': 0.09} 9%|▉ | 124/1336 [12:09<1:55:37, 5.72s/it] 9%|▉ | 125/1336 [12:15<1:54:57, 5.70s/it] {'loss': 0.6523, 'grad_norm': 49.788287003656635, 'learning_rate': 4.948272034003275e-07, 'losses/dpo': 0.6400548219680786, 'losses/sft': 0.3020136058330536, 'losses/total': 0.6400548219680786, 'rewards/chosen': -0.10167922079563141, 'rewards/rejected': -0.1906861960887909, 'rewards/accuracies': 0.75, 'rewards/margins': 0.08900698274374008, 'logps/rejected': -81.24292755126953, 'logps/chosen': -69.7929916381836, 'ref_logps/rejected': -79.3360595703125, 'ref_logps/chosen': -68.77619934082031, 'epoch': 0.09} 9%|▉ | 125/1336 [12:15<1:54:57, 5.70s/it] 9%|▉ | 126/1336 [12:21<1:56:59, 5.80s/it] {'loss': 0.6652, 'grad_norm': 78.46080334773565, 'learning_rate': 4.947037477658864e-07, 'losses/dpo': 0.7279879450798035, 'losses/sft': 1.1443910598754883, 'losses/total': 0.7279879450798035, 'rewards/chosen': -0.17663787305355072, 'rewards/rejected': -0.23908114433288574, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.06244327872991562, 'logps/rejected': -110.05540466308594, 'logps/chosen': -110.53145599365234, 'ref_logps/rejected': -107.66458129882812, 'ref_logps/chosen': -108.76507568359375, 'epoch': 0.09} 9%|▉ | 126/1336 [12:21<1:56:59, 5.80s/it] 10%|▉ | 127/1336 [12:27<1:57:44, 5.84s/it] {'loss': 0.6813, 'grad_norm': 65.21038304074122, 'learning_rate': 4.945788520050301e-07, 'losses/dpo': 0.7071346640586853, 'losses/sft': 0.5275065302848816, 'losses/total': 0.7071346640586853, 'rewards/chosen': -0.15979382395744324, 'rewards/rejected': -0.1883414089679718, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.028547586873173714, 'logps/rejected': -90.99200439453125, 'logps/chosen': -87.62205505371094, 'ref_logps/rejected': -89.10859680175781, 'ref_logps/chosen': -86.02410888671875, 'epoch': 0.1} 10%|▉ | 127/1336 [12:27<1:57:44, 5.84s/it] 10%|▉ | 128/1336 [12:32<1:57:17, 5.83s/it] {'loss': 0.6732, 'grad_norm': 60.346847541600944, 'learning_rate': 4.944525168527931e-07, 'losses/dpo': 0.6760457158088684, 'losses/sft': 0.34599775075912476, 'losses/total': 0.6760457158088684, 'rewards/chosen': -0.07553420960903168, 'rewards/rejected': -0.1287817507982254, 'rewards/accuracies': 0.5, 'rewards/margins': 0.05324753373861313, 'logps/rejected': -79.74027252197266, 'logps/chosen': -71.22399139404297, 'ref_logps/rejected': -78.45245361328125, 'ref_logps/chosen': -70.46864318847656, 'epoch': 0.1} 10%|▉ | 128/1336 [12:32<1:57:17, 5.83s/it] 10%|▉ | 129/1336 [12:38<1:56:14, 5.78s/it] {'loss': 0.6817, 'grad_norm': 45.62834988145372, 'learning_rate': 4.943247430526809e-07, 'losses/dpo': 0.7453728318214417, 'losses/sft': 0.46439453959465027, 'losses/total': 0.7453728318214417, 'rewards/chosen': -0.12877212464809418, 'rewards/rejected': -0.16609865427017212, 'rewards/accuracies': 0.40625, 'rewards/margins': 0.03732652962207794, 'logps/rejected': -84.09970092773438, 'logps/chosen': -83.1222915649414, 'ref_logps/rejected': -82.438720703125, 'ref_logps/chosen': -81.83456420898438, 'epoch': 0.1} 10%|▉ | 129/1336 [12:38<1:56:14, 5.78s/it] 10%|▉ | 130/1336 [12:44<1:56:40, 5.81s/it] {'loss': 0.6753, 'grad_norm': 74.64337110987802, 'learning_rate': 4.941955313566656e-07, 'losses/dpo': 0.5975279808044434, 'losses/sft': 0.8834252953529358, 'losses/total': 0.5975279808044434, 'rewards/chosen': -0.17224165797233582, 'rewards/rejected': -0.21711814403533936, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.044876500964164734, 'logps/rejected': -93.73190307617188, 'logps/chosen': -86.57962036132812, 'ref_logps/rejected': -91.56071472167969, 'ref_logps/chosen': -84.85720825195312, 'epoch': 0.1} 10%|▉ | 130/1336 [12:44<1:56:40, 5.81s/it] 10%|▉ | 131/1336 [12:50<1:56:20, 5.79s/it] {'loss': 0.7033, 'grad_norm': 86.90917688383999, 'learning_rate': 4.94064882525182e-07, 'losses/dpo': 0.7257283926010132, 'losses/sft': 1.0290143489837646, 'losses/total': 0.7257283926010132, 'rewards/chosen': -0.2130013108253479, 'rewards/rejected': -0.19828438758850098, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.01471691019833088, 'logps/rejected': -92.44575500488281, 'logps/chosen': -93.2861328125, 'ref_logps/rejected': -90.46290588378906, 'ref_logps/chosen': -91.15612030029297, 'epoch': 0.1} 10%|▉ | 131/1336 [12:50<1:56:20, 5.79s/it] 10%|▉ | 132/1336 [12:55<1:54:43, 5.72s/it] {'loss': 0.6961, 'grad_norm': 69.2287202496006, 'learning_rate': 4.939327973271221e-07, 'losses/dpo': 0.7045786380767822, 'losses/sft': 0.7416899800300598, 'losses/total': 0.7045786380767822, 'rewards/chosen': -0.19965049624443054, 'rewards/rejected': -0.2017679512500763, 'rewards/accuracies': 0.46875, 'rewards/margins': 0.0021174438297748566, 'logps/rejected': -94.13520812988281, 'logps/chosen': -90.3099365234375, 'ref_logps/rejected': -92.11752319335938, 'ref_logps/chosen': -88.31343078613281, 'epoch': 0.1} 10%|▉ | 132/1336 [12:55<1:54:43, 5.72s/it] 10%|▉ | 133/1336 [13:01<1:56:31, 5.81s/it] {'loss': 0.703, 'grad_norm': 51.27685875836551, 'learning_rate': 4.937992765398316e-07, 'losses/dpo': 0.7032231688499451, 'losses/sft': 0.965686023235321, 'losses/total': 0.7032231688499451, 'rewards/chosen': -0.19010215997695923, 'rewards/rejected': -0.17823673784732819, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.011865412816405296, 'logps/rejected': -83.22978210449219, 'logps/chosen': -85.92092895507812, 'ref_logps/rejected': -81.44741821289062, 'ref_logps/chosen': -84.01990509033203, 'epoch': 0.1} 10%|▉ | 133/1336 [13:01<1:56:31, 5.81s/it] 10%|█ | 134/1336 [13:07<1:57:52, 5.88s/it] {'loss': 0.6868, 'grad_norm': 70.22436244318027, 'learning_rate': 4.936643209491051e-07, 'losses/dpo': 0.7534513473510742, 'losses/sft': 1.1029878854751587, 'losses/total': 0.7534513473510742, 'rewards/chosen': -0.20861941576004028, 'rewards/rejected': -0.22816208004951477, 'rewards/accuracies': 0.625, 'rewards/margins': 0.019542664289474487, 'logps/rejected': -101.95588684082031, 'logps/chosen': -104.39210510253906, 'ref_logps/rejected': -99.67426300048828, 'ref_logps/chosen': -102.305908203125, 'epoch': 0.1} 10%|█ | 134/1336 [13:07<1:57:52, 5.88s/it] 0%| | 0/58 [00:00