======================== START TIME: Wed Jul 3 04:07:43 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 04:07:46.423000 140515379451712 torch/distributed/run.py:757] W0703 04:07:46.423000 140515379451712 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.423000 140515379451712 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.423000 140515379451712 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.427000 140126076323648 torch/distributed/run.py:757] W0703 04:07:46.427000 140126076323648 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.427000 140126076323648 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.427000 140126076323648 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.425000 140704491784000 torch/distributed/run.py:757] W0703 04:07:46.425000 140704491784000 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.425000 140704491784000 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.425000 140704491784000 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.434000 139909740459840 torch/distributed/run.py:757] W0703 04:07:46.434000 139909740459840 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.434000 139909740459840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.434000 139909740459840 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.477000 140428738991936 torch/distributed/run.py:757] W0703 04:07:46.477000 140428738991936 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.477000 140428738991936 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.477000 140428738991936 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.538000 140430287660864 torch/distributed/run.py:757] W0703 04:07:46.538000 140430287660864 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.538000 140430287660864 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.538000 140430287660864 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.561000 140275647977280 torch/distributed/run.py:757] W0703 04:07:46.561000 140275647977280 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.561000 140275647977280 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.561000 140275647977280 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.562000 140712564258624 torch/distributed/run.py:757] W0703 04:07:46.562000 140712564258624 torch/distributed/run.py:757] ***************************************** W0703 04:07:46.562000 140712564258624 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 04:07:46.562000 140712564258624 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 04:08:07 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Vocab Size Padding] Padded vocab (size: 50257) with 15 dummy tokens (new size: 50272) [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config: [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: run='%date_%jobid', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: step=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: consumed_train_samples=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: benchmark_csv_path=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ignore_sanity_checks=True), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp=2, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp_engine=, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_mode=, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_linear_async_communication=False, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: expert_parallel_size=1), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50272), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dtype=torch.bfloat16, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_revision=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_max_length=None), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoint_interval=100000, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: save_initial_state=False, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: resume_checkpoint_path=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: log_level_replica='info', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: iteration_step_info_interval=1), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: train_steps=20, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: micro_batch_size=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: batch_accumulation_per_replica=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: val_check_interval=-1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_val_batches=0, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_test_batches=0), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta1=0.9, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta2=0.95, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: torch_adam_is_fused=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: name='adamW'), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: zero_stage=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: weight_decay=0.01, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: clip_grad=1.0, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_steps=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_style='linear', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_style='linear', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_steps=19, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_starting_step=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: min_decay_lr=1e-05)), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: start_training_step=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_splits='train', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_config_name=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_overwrite_cache=False, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: text_column_name='text'), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_loading_workers=0))], [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-32_pp-2_mbz-32')), [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lighteval=None) [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Model Config: [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50272) [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Building model.. [default0]:07/03/2024 04:08:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Setting PP block ranks... [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=24|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=24|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=24|ip-26-0-165-24]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Total number of parameters: 1.22G (2318.88MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Parametrizing model parameters using StandardParametrizator [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-173-202]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=24|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=24|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=24|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-166-125]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=28|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=28|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=28|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=30|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=30|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-173-202]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-173-202]: No checkpoint path provided. [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=30|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=27|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=27|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=27|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=26|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=26|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=28|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=28|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=28|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=26|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=22|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=22|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=22|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=20|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=20|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=20|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-173-202]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=18|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=18|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=18|ip-26-0-173-246]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-173-202]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-173-202]: No checkpoint path provided. [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-173-202]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-173-202]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-173-202]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-173-202]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=27|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=27|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=27|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: No checkpoint path provided. [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-166-125]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=26|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-166-125]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=26|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=26|ip-26-0-165-24]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=19|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=19|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=19|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-166-125]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=25|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=25|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-166-125]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=25|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-166-125]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=31|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=31|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=31|ip-26-0-174-36]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=29|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=29|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=29|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=30|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=30|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=30|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=16|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=16|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=16|ip-26-0-164-207]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-166-125]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=29|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=29|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=29|ip-26-0-165-24]: No checkpoint path provided. [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-166-125]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-166-125]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-166-125]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=25|ip-26-0-174-36]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=25|ip-26-0-174-36]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=25|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=31|ip-26-0-165-24]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=31|ip-26-0-165-24]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=31|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=17|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=17|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=17|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=21|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=23|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=16|ip-26-0-173-246]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=16|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=23|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=23|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=16|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=21|ip-26-0-173-246]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=1|TP=21|ip-26-0-173-246]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=20|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=20|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=20|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=18|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=17|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=17|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default1]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=17|ip-26-0-164-207]: No checkpoint path provided. [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=23|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=23|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=23|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=18|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=18|ip-26-0-164-207]: No checkpoint path provided. [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=19|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=19|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=19|ip-26-0-164-207]: No checkpoint path provided. [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=22|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=22|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=22|ip-26-0-164-207]: No checkpoint path provided. [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=21|ip-26-0-164-207]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=21|ip-26-0-164-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 04:08:25 [INFO|DP=0|PP=0|TP=21|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 04:08:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 04:08:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 04:08:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] DP Rank 0 has 21.6M out of 21.6M (100.00%) params' optimizer states [default0]:07/03/2024 04:08:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 04:08:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Using `datasets` library [default0]:07/03/2024 04:08:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 04:08:27 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] There are 1 training stages [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Start training] datetime: 2024-07-03 04:08:28.604204 | mbs: 32 | grad_accum: 32 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 04:08:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Memory usage: 220.25MiB. Peak allocated 220.25MiB. Peak reserved: 240.00MiB [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=24|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=10|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=13|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=14|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=2|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=10|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=6|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=16|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=28|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=9|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=11|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=12|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=28|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=27|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=30|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=26|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=22|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=20|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=18|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=15|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=14|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=8|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=27|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=7|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=3|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=12|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=5|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=8|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=4|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=7|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=19|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=26|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=24|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=4|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=31|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=6|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=29|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=1|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=25|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=0|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=30|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=16|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=29|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=9|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=5|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=11|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=13|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=25|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=31|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=23|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=17|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=1|TP=21|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=20|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=15|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=18|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=23|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=19|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=22|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=21|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 04:08:28 [WARNING|DP=0|PP=0|TP=17|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 04:08:29 [WARNING|DP=0|PP=0|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default1]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default1]:07/03/2024 04:08:43 [WARNING|DP=0|PP=1|TP=9|ip-26-0-173-202]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default1]:07/03/2024 04:08:43 [WARNING|DP=0|PP=1|TP=9|ip-26-0-173-202]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank35]: Traceback (most recent call last): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: trainer.train(dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank35]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank35]: grad_accumulator.backward(sum(activations)) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank35]: result = loss.backward() [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank35]: torch.autograd.backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank35]: _engine_run_backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank35]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank35]: return user_fn(self, *args) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py", line 821, in backward [default3]:[rank35]: dx, dw, db, dresidual_in, dx1, dw1, db1 = _layer_norm_bwd( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py", line 643, in _layer_norm_bwd [default3]:[rank35]: _layer_norm_bwd_kernel[grid]( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in [default3]:[rank35]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in run [default3]:[rank35]: timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in [default3]:[rank35]: timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 122, in _bench [default3]:[rank35]: return do_bench(kernel_call, warmup=self.warmup, rep=self.rep, quantiles=(0.5, 0.2, 0.8)) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench [default3]:[rank35]: fn() [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 110, in kernel_call [default3]:[rank35]: self.fn.run( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default3]:[rank35]: return self.fn.run(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default3]:[rank35]: return self.fn.run(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default3]:[rank35]: return self.fn.run(*args, **kwargs) [default3]:[rank35]: [Previous line repeated 2 more times] [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run [default3]:[rank35]: self.cache[device][key] = compile( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 202, in compile [default3]:[rank35]: return CompiledKernel(so_path, metadata_group.get(metadata_filename)) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 230, in __init__ [default3]:[rank35]: self.asm = { [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 231, in [default3]:[rank35]: file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.binary_ext else file.read_text() [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/pathlib.py", line 1134, in read_text [default3]:[rank35]: with self.open(mode='r', encoding=encoding, errors=errors) as f: [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/pathlib.py", line 1119, in open [default3]:[rank35]: return self._accessor.open(self, mode, buffering, encoding, errors, [default3]:[rank35]: FileNotFoundError: [Errno 2] No such file or directory: '/admin/home/ferdinand_mom/.triton/cache/3bf15d0349ccda200916bdbc5ec3c269/_layer_norm_bwd_kernel.cubin.tmp.pid_1315311_776646' [default2]:[rank26]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank24]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank24]: grad_accumulator.backward(sum(activations)) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank24]: result = loss.backward() [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank24]: torch.autograd.backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank24]: _engine_run_backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank24]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank24]: return user_fn(self, *args) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank24]: pipeline_state.run_communication() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank24]: send_activation() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank24]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank24]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank24]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank24]: dist.send( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank24]: return func(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank24]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank24]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank14]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank14]: grad_accumulator.backward(sum(activations)) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank14]: result = loss.backward() [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank14]: torch.autograd.backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank14]: _engine_run_backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank14]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank14]: return user_fn(self, *args) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank14]: send_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank14]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank14]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank14]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank14]: dist.send( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank14]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank15]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank15]: grad_accumulator.backward(sum(activations)) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank15]: result = loss.backward() [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank15]: torch.autograd.backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank15]: _engine_run_backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank15]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank15]: return user_fn(self, *args) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank15]: send_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank15]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank15]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank15]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank15]: dist.send( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank15]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank11]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank11]: grad_accumulator.backward(sum(activations)) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank11]: result = loss.backward() [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank11]: torch.autograd.backward( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank11]: _engine_run_backward( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank11]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank11]: return user_fn(self, *args) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank11]: send_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank11]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank11]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank11]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank11]: dist.send( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank11]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank12]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank12]: grad_accumulator.backward(sum(activations)) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank12]: result = loss.backward() [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank12]: torch.autograd.backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank12]: _engine_run_backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank12]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank12]: return user_fn(self, *args) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank12]: send_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank12]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank12]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank12]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank12]: dist.send( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank12]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank30]: Traceback (most recent call last): [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: trainer.train(dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank30]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank30]: grad_accumulator.backward(sum(activations)) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank30]: result = loss.backward() [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank30]: torch.autograd.backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank30]: _engine_run_backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank30]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank30]: return user_fn(self, *args) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank30]: pipeline_state.run_communication() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank30]: send_activation() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank30]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank30]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank30]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank30]: dist.send( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank30]: return func(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank30]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank30]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: trainer.train(dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank27]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank27]: grad_accumulator.backward(sum(activations)) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank27]: result = loss.backward() [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank27]: torch.autograd.backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank27]: _engine_run_backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank27]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank27]: return user_fn(self, *args) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank27]: pipeline_state.run_communication() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank27]: send_activation() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank27]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank27]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank27]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank27]: dist.send( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank27]: return func(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank27]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank27]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank19]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank19]: grad_accumulator.backward(sum(activations)) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank19]: result = loss.backward() [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank19]: torch.autograd.backward( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank19]: _engine_run_backward( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank19]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank19]: return user_fn(self, *args) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank19]: pipeline_state.run_communication() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank19]: send_activation() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank19]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank19]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank19]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank19]: dist.send( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank19]: return func(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank19]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank19]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank22]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank22]: grad_accumulator.backward(sum(activations)) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank22]: result = loss.backward() [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank22]: torch.autograd.backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank22]: _engine_run_backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank22]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank22]: return user_fn(self, *args) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank22]: pipeline_state.run_communication() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank22]: send_activation() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank22]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank22]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank22]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank22]: dist.send( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank22]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank22]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank22]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank3]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank3]: grad_accumulator.backward(sum(activations)) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank3]: result = loss.backward() [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank3]: torch.autograd.backward( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank3]: _engine_run_backward( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank3]: return user_fn(self, *args) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank3]: pipeline_state.run_communication() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank3]: send_activation() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank3]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank3]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank3]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank3]: dist.send( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank3]: return func(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank3]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank3]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank1]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank1]: grad_accumulator.backward(sum(activations)) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank1]: result = loss.backward() [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank1]: torch.autograd.backward( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank1]: _engine_run_backward( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank1]: return user_fn(self, *args) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank1]: pipeline_state.run_communication() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank1]: send_activation() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank1]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank1]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank1]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank1]: dist.send( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank1]: return func(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank1]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank2]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank2]: grad_accumulator.backward(sum(activations)) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank2]: result = loss.backward() [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank2]: torch.autograd.backward( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank2]: _engine_run_backward( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank2]: return user_fn(self, *args) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank2]: pipeline_state.run_communication() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank2]: send_activation() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank2]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank2]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank2]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank2]: dist.send( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank2]: return func(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank2]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank2]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank20]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank20]: grad_accumulator.backward(sum(activations)) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank20]: result = loss.backward() [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank20]: torch.autograd.backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank20]: _engine_run_backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank20]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank20]: return user_fn(self, *args) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank20]: send_activation() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank20]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank20]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank20]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank20]: dist.send( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank20]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank20]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank28]: trainer.train(dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank28]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank28]: grad_accumulator.backward(sum(activations)) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank28]: result = loss.backward() [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank28]: torch.autograd.backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank28]: _engine_run_backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank28]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank28]: return user_fn(self, *args) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank28]: pipeline_state.run_communication() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank28]: send_activation() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank28]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank28]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank28]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank28]: dist.send( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank28]: return func(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank28]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank28]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank29]: Traceback (most recent call last): [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank29]: trainer.train(dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank29]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank29]: grad_accumulator.backward(sum(activations)) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank29]: result = loss.backward() [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank29]: torch.autograd.backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank29]: _engine_run_backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank29]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank29]: return user_fn(self, *args) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank29]: pipeline_state.run_communication() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank29]: send_activation() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank29]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank29]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank29]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank29]: dist.send( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank29]: return func(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank29]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank29]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank16]: trainer.train(dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank16]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank16]: grad_accumulator.backward(sum(activations)) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank16]: result = loss.backward() [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank16]: torch.autograd.backward( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank16]: _engine_run_backward( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank16]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank16]: return user_fn(self, *args) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank16]: pipeline_state.run_communication() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank16]: send_activation() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank16]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank16]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank16]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank16]: dist.send( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank16]: return func(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank16]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank16]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank0]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank0]: grad_accumulator.backward(sum(activations)) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank0]: result = loss.backward() [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank0]: torch.autograd.backward( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank0]: _engine_run_backward( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank0]: return user_fn(self, *args) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank0]: pipeline_state.run_communication() [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank0]: send_activation() [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank0]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank0]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank0]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank0]: dist.send( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank0]: return func(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank0]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank5]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank5]: grad_accumulator.backward(sum(activations)) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank5]: result = loss.backward() [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank5]: torch.autograd.backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank5]: _engine_run_backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank5]: return user_fn(self, *args) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank5]: send_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank5]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank5]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank5]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank5]: dist.send( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank5]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank6]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank6]: grad_accumulator.backward(sum(activations)) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank6]: result = loss.backward() [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank6]: torch.autograd.backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank6]: _engine_run_backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank6]: return user_fn(self, *args) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank6]: send_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank6]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank6]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank6]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank6]: dist.send( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank6]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank31]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank31]: trainer.train(dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank31]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank31]: grad_accumulator.backward(sum(activations)) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank31]: result = loss.backward() [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank31]: torch.autograd.backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank31]: _engine_run_backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank31]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank31]: return user_fn(self, *args) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank31]: pipeline_state.run_communication() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank31]: send_activation() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank31]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank31]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank31]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank31]: dist.send( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: return func(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank31]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank31]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61dfad8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f61e0db1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f61e0db6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61e0db7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f622c850e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f6231897609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f6231662353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61dfad8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f61e0db1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f61e0db6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61e0db7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f622c850e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f6231897609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f6231662353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61dfad8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f61e0a3b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f622c850e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f6231897609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f6231662353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank7]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank7]: grad_accumulator.backward(sum(activations)) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank7]: result = loss.backward() [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank7]: torch.autograd.backward( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank7]: _engine_run_backward( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank7]: return user_fn(self, *args) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank7]: send_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank7]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank7]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank7]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank7]: dist.send( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank7]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank25]: Traceback (most recent call last): [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank25]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank25]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank25]: grad_accumulator.backward(sum(activations)) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank25]: result = loss.backward() [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank25]: torch.autograd.backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank25]: _engine_run_backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank25]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank25]: return user_fn(self, *args) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank25]: pipeline_state.run_communication() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank25]: send_activation() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank25]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank25]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank25]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank25]: dist.send( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank25]: return func(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank25]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank25]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank4]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank4]: grad_accumulator.backward(sum(activations)) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank4]: result = loss.backward() [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank4]: torch.autograd.backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank4]: _engine_run_backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank4]: return user_fn(self, *args) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank4]: send_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank4]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank4]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank4]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank4]: dist.send( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank4]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:563] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:563] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600021 milliseconds before timing out. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:563] [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:563] [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:563] [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:563] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc8ccc19897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc8cdef2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc8cdef7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc8cdef8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fc919991e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fc91e9d8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fc91e7a3353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc8ccc19897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc8cdef2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc8cdef7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc8cdef8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fc919991e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fc91e9d8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fc91e7a3353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc8ccc19897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7fc8cdb7c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7fc919991e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7fc91e9d8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7fc91e7a3353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fde797ca897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fde7aaa3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fde7aaa8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fde7aaa9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fdec6542e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fdecb589609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fdecb354353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fde797ca897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fde7aaa3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fde7aaa8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fde7aaa9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fdec6542e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fdecb589609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fdecb354353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fde797ca897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7fde7a72d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fdec6542e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fdecb589609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fdecb354353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6179314897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f617a5edc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f617a5f2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f617a5f3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f61c608ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f61cb0d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f61cae9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6179314897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f617a5edc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f617a5f2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f617a5f3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f61c608ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f61cb0d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f61cae9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6179314897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f617a277119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f61c608ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f61cb0d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f61cae9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: trainer.train(dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank26]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank26]: grad_accumulator.backward(sum(activations)) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank26]: result = loss.backward() [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank26]: torch.autograd.backward( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank26]: _engine_run_backward( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank26]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank26]: return user_fn(self, *args) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank26]: pipeline_state.run_communication() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank26]: send_activation() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank26]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank26]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank26]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank26]: dist.send( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank26]: return func(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank26]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank26]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0829b12897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f082adebc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f082adf0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f082adf1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f087688ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f087b8d1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f087b69c353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0829b12897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f082adebc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f082adf0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f082adf1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f087688ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f087b8d1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f087b69c353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0829b12897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f082aa75119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f087688ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f087b8d1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f087b69c353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:[rank31]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9db2641897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7251bbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7252e97c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7252e9ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7252e9ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f729e936e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f72a397d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f72a3748353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9db391ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9db391fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9db3920dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f9dff3b9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f9e04400609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7251bbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #6: clone + 0x43 (0x7f9e041cb353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7252e97c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7252e9ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7252e9ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f729e936e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #5: + 0x8609 (0x7f72a397d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f72a3748353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7251bbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9db2641897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f7252b21119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9db391ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9db391fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9db3920dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f9dff3b9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f9e04400609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f9e041cb353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default7]:frame #2: + 0xd3e95 (0x7f729e936e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f72a397d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9db2641897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f9db35a4119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f9dff3b9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f9e04400609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f9e041cb353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default7]:frame #4: clone + 0x43 (0x7f72a3748353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbaeb536897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fbaec80fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbaec814a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbaec815dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fbb382aee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fbb3d2f5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fbb3d0c0353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbaeb536897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fbaec80fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbaec814a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbaec815dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fbb382aee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fbb3d2f5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fbb3d0c0353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbaeb536897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7fbaec499119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fbb382aee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fbb3d2f5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fbb3d0c0353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default3]:[rank27]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f35db93e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f35dcc17c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f35dcc1ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35dcc1ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f36286b6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f362d6fd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f362d4c8353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f35db93e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f35dcc17c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f35dcc1ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f35dcc1ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f36286b6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f362d6fd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f362d4c8353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f35db93e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f35dc8a1119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f36286b6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f362d6fd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f362d4c8353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank25]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19b57d1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f19b6aaac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f19b6aafa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f19b6ab0dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f1a02549e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f1a07590609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f1a0735b353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19b57d1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f19b6aaac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f19b6aafa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f19b6ab0dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f1a02549e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f1a07590609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f1a0735b353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19b57d1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f19b6734119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f1a02549e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f1a07590609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f1a0735b353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default0]:[rank16]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f781d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2f794aec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2f794b3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2f794b4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f2fc4f4de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f2fc9f94609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f2fc9d5f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f781d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2f794aec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2f794b3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2f794b4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f2fc4f4de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f2fc9f94609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f2fc9d5f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f781d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f2f79138119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f2fc4f4de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f2fc9f94609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f2fc9d5f353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank23]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank23]: grad_accumulator.backward(sum(activations)) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank23]: result = loss.backward() [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank23]: torch.autograd.backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank23]: _engine_run_backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank23]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank23]: return user_fn(self, *args) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank23]: pipeline_state.run_communication() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank23]: send_activation() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank23]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank23]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank23]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank23]: dist.send( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank23]: return func(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank23]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank23]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank17]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank17]: grad_accumulator.backward(sum(activations)) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank17]: result = loss.backward() [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank17]: torch.autograd.backward( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank17]: _engine_run_backward( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank17]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank17]: return user_fn(self, *args) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank17]: pipeline_state.run_communication() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank17]: send_activation() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank17]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank17]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank17]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank17]: dist.send( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank17]: return func(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank17]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank17]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank21]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank21]: grad_accumulator.backward(sum(activations)) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank21]: result = loss.backward() [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank21]: torch.autograd.backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank21]: _engine_run_backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank21]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank21]: return user_fn(self, *args) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank21]: pipeline_state.run_communication() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank21]: send_activation() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank21]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank21]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank21]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank21]: dist.send( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: return func(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank21]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank18]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank18]: grad_accumulator.backward(sum(activations)) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank18]: result = loss.backward() [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank18]: torch.autograd.backward( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank18]: _engine_run_backward( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank18]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank18]: return user_fn(self, *args) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank18]: pipeline_state.run_communication() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank18]: send_activation() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank18]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank18]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank18]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank18]: dist.send( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank18]: return func(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank18]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank18]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c37c0c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0c38ee5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0c38eeaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c38eebdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f0c84984e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f0c899cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f0c89796353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c37c0c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0c38ee5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0c38eeaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c38eebdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f0c84984e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f0c899cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f0c89796353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c37c0c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f0c38b6f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f0c84984e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f0c899cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f0c89796353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default1]:[rank41]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:563] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:563] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:563] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:563] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:563] [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:563] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:563] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:563] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:563] [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:563] [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank10]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank10]: grad_accumulator.backward(sum(activations)) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank10]: result = loss.backward() [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank10]: torch.autograd.backward( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank10]: _engine_run_backward( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank10]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank10]: return user_fn(self, *args) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank10]: send_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank10]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank10]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank10]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank10]: dist.send( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank10]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank13]: Traceback (most recent call last): [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]: trainer.train(dataloader) [default7]:[rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb940a20897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb941cf9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank13]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb941cfea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb941cffdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fb98d798e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fb9927df609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fb9925aa353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]:[rank13]: grad_accumulator.backward(sum(activations)) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank13]: result = loss.backward() [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank13]: torch.autograd.backward( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb940a20897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank13]: _engine_run_backward( [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb941cf9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb941cfea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb941cffdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fb98d798e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:[rank8]: trainer.train(dataloader) [default5]:[rank13]: return user_fn(self, *args) [default7]:frame #5: + 0x8609 (0x7fb9927df609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fb9925aa353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank13]: pipeline_state.run_communication() [default7]: [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb940a20897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fb941983119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default7]:frame #2: + 0xd3e95 (0x7fb98d798e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank13]: send_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank13]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank8]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:frame #3: + 0x8609 (0x7fb9927df609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:frame #4: clone + 0x43 (0x7fb9925aa353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]: [default0]:[rank8]: grad_accumulator.backward(sum(activations)) [default5]:[rank13]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank8]: result = loss.backward() [default5]:[rank13]: dist.send( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c8ea3c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4c8fd15c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4c8fd1aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4c8fd1bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: torch.autograd.backward( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank13]: return func(*args, **kwargs) [default3]:frame #4: + 0xd3e95 (0x7f4cdb7b4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank13]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank8]: _engine_run_backward( [default3]:frame #5: + 0x8609 (0x7f4ce07fb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:frame #6: clone + 0x43 (0x7f4ce05c6353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank8]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank8]: return user_fn(self, *args) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank8]: pipeline_state.run_communication() [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c8ea3c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank8]: send_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4c8fd15c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4c8fd1aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4c8fd1bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f4cdb7b4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:[rank8]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:frame #5: + 0x8609 (0x7f4ce07fb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:frame #6: clone + 0x43 (0x7f4ce05c6353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default0]:[rank8]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c8ea3c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank8]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:frame #1: + 0xe32119 (0x7f4c8f99f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:frame #2: + 0xd3e95 (0x7f4cdb7b4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:[rank8]: dist.send( [default3]:frame #3: + 0x8609 (0x7f4ce07fb609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:frame #4: clone + 0x43 (0x7f4ce05c6353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]:[rank9]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank8]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]: [default1]:[rank9]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank9]: grad_accumulator.backward(sum(activations)) [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff2f7a1c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff2f8cf5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff2f8cfaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: result = loss.backward() [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank9]: torch.autograd.backward( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank9]: _engine_run_backward( [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff2f8cfbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ff344794e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7ff3497db609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank9]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:frame #6: clone + 0x43 (0x7ff3495a6353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank9]: return user_fn(self, *args) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank9]: send_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff2f7a1c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff2f8cf5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank9]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff2f8cfaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff2f8cfbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ff344794e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank9]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank9]: dist.send( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank9]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:frame #5: + 0x8609 (0x7ff3497db609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7ff3495a6353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff2f7a1c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7ff2f897f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7ff344794e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7ff3497db609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7ff3495a6353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default5]:[rank45]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f0f662897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9f1093bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9f10940a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9f10941dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f9f5c3dae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f9f61421609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f9f611ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f0f662897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9f1093bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9f10940a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9f10941dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f9f5c3dae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f9f61421609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f9f611ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f0f662897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f9f105c5119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f9f5c3dae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f9f61421609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f9f611ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f033dcbf897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f033ef98c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f033ef9da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f033ef9edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f038aa37e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f038fa7e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f038f849353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f033dcbf897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f033ef98c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f033ef9da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f033ef9edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f038aa37e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f038fa7e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f038f849353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f033dcbf897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f033ec22119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f038aa37e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f038fa7e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f038f849353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05b7d0f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f05b8fe8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f05b8feda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f05b8feedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f0604a87e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f0609ace609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f0609899353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05b7d0f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f05b8fe8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f05b8feda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f05b8feedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f0604a87e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f0609ace609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f0609899353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05b7d0f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f05b8c72119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f0604a87e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f0609ace609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f0609899353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7c2391897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff7c366ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff7c366fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff7c3670dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff80f109e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff814150609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7ff813f1b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7c2391897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff7c366ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff7c366fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff7c3670dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff80f109e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff814150609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7ff813f1b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7c2391897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7ff7c32f4119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7ff80f109e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7ff814150609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7ff813f1b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5399f0e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f539b1e7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f539b1eca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f539b1eddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f53e6c86e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f53ebccd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f53eba98353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5399f0e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f539b1e7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f539b1eca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f539b1eddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f53e6c86e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f53ebccd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f53eba98353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5399f0e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f539ae71119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f53e6c86e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f53ebccd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f53eba98353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5282bac897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5283e85c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5283e8aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5283e8bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f52cf924e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f52d496b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f52d4736353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5282bac897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5283e85c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5283e8aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5283e8bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f52cf924e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f52d496b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f52d4736353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5282bac897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f5283b0f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f52cf924e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f52d496b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f52d4736353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f74b59897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8f75e32c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f75e37a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f75e38dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f8fc18d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5203794897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #5: + 0x8609 (0x7f8fc6918609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f8fc66e3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5204a6dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5204a72a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f74b59897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8f75e32c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f75e37a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f75e38dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f8fc18d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5204a73dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f525050ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f5255553609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #5: + 0x8609 (0x7f8fc6918609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f525531e353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]:frame #6: clone + 0x43 (0x7f8fc66e3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default6]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f74b59897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #1: + 0xe32119 (0x7f8f75abc119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f8fc18d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f8fc6918609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5203794897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #4: clone + 0x43 (0x7f8fc66e3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5204a6dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5204a72a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5204a73dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f525050ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f5255553609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f525531e353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5203794897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f52046f7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f525050ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f5255553609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f525531e353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default2]:[rank26]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad66438897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fad67711c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad67716a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad67717dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fadb31b0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fadb81f7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fadb7fc2353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad66438897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fad67711c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad67716a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad67717dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fadb31b0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fadb81f7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fadb7fc2353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad66438897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fad6739b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fadb31b0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fadb81f7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fadb7fc2353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdae4baa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fdae5e83c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fdae5e88a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fdae5e89dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fdb31922e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fdb36969609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fdb36734353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdae4baa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fdae5e83c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fdae5e88a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fdae5e89dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fdb31922e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fdb36969609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fdb36734353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdae4baa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7fdae5b0d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7fdb31922e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7fdb36969609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7fdb36734353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default2]:[rank18]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b99c2d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3b9af06c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3b9af0ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b9af0cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f3be69a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f3beb9ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f3beb7b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b99c2d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3b9af06c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3b9af0ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b9af0cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f3be69a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f3beb9ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f3beb7b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b99c2d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f3b9ab90119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f3be69a5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f3beb9ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f3beb7b7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01ea53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd01fd2cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01fd31a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01fd32dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fd06b7cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fd070812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fd0705dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01ea53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd01fd2cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01fd31a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01fd32dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fd06b7cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fd070812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fd0705dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01ea53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fd01f9b6119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fd06b7cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fd070812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fd0705dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default1]:[rank17]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60a1b63897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f60a2e3cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f60a2e41a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f60a2e42dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f60ee8dbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f60f3922609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f60f36ed353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60a1b63897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f60a2e3cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f60a2e41a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f60a2e42dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f60ee8dbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f60f3922609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f60f36ed353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60a1b63897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f60a2ac6119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f60ee8dbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f60f3922609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f60f36ed353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e01367897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3e02640c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3e02645a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3e02646dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f3e4e0dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f3e53126609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f3e52ef1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e01367897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3e02640c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3e02645a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3e02646dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f3e4e0dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f3e53126609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f3e52ef1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e01367897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f3e022ca119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f3e4e0dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f3e53126609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f3e52ef1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a0b686897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2a0c95fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a0c964a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a0c965dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f2a583fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f2a5d445609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f2a5d210353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a0b686897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2a0c95fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a0c964a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a0c965dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f2a583fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f2a5d445609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f2a5d210353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a0b686897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f2a0c5e9119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f2a583fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f2a5d445609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f2a5d210353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcbca71d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fcbcb9f6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fcbcb9fba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fcbcb9fcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fcc17495e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fcc1c4dc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fcc1c2a7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcbca71d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fcbcb9f6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fcbcb9fba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fcbcb9fcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fcc17495e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fcc1c4dc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fcc1c2a7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcbca71d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fcbcb680119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fcc17495e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fcc1c4dc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fcc1c2a7353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 11, last enqueued NCCL work: 12, last completed NCCL work: 10. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa28fe86897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa29115fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa291164a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa291165dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fa2dcbfee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fa2e1c45609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fa2e1a10353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=11, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa28fe86897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa29115fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa291164a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa291165dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fa2dcbfee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fa2e1c45609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fa2e1a10353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa28fe86897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7fa290de9119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7fa2dcbfee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7fa2e1c45609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7fa2e1a10353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 26] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:577] [Rank 26] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:583] [Rank 26] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7eff9ec84897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7eff9ff5dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7eff9ff62a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7eff9ff63dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7effeb9fce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7efff0a43609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7efff080e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 2 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7eff9ec84897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7eff9ff5dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7eff9ff62a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7eff9ff63dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7effeb9fce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7efff0a43609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7efff080e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7eff9ec84897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7eff9fbe7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7effeb9fce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7efff0a43609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7efff080e353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 30] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:577] [Rank 30] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:583] [Rank 30] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05293d9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f052a6b2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f052a6b7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f052a6b8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f0576151e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f057b198609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f057af63353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 2 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05293d9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f052a6b2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f052a6b7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f052a6b8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f0576151e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f057b198609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f057af63353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05293d9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f052a33c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f0576151e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f057b198609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f057af63353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 31] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:577] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:583] [Rank 31] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f00d7a10897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f00d8ce9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f00d8ceea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f00d8cefdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f0124788e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f01297cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f012959a353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 2 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f00d7a10897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f00d8ce9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f00d8ceea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f00d8cefdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f0124788e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f01297cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f012959a353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f00d7a10897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f00d8973119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f0124788e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f01297cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f012959a353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 27] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:577] [Rank 27] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:583] [Rank 27] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0a363d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fe0a4916c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe0a491ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe0a491cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fe0f03b5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fe0f53fc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fe0f51c7353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 2 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 24] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0a363d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fe0a4916c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe0a491ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe0a491cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]:[E ProcessGroupNCCL.cpp:577] [Rank 24] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:frame #4: + 0xd3e95 (0x7fe0f03b5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:[rank56]:[E ProcessGroupNCCL.cpp:583] [Rank 24] To avoid data inconsistency, we are taking the entire process down. [default3]:frame #5: + 0x8609 (0x7fe0f53fc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default3]:frame #6: clone + 0x43 (0x7fe0f51c7353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]: [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f678b58c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f678c865c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f678c86aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f678c86bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f67d8304e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f67dd34b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f67dd116353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 2 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0a363d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7fe0a45a0119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7fe0f03b5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7fe0f53fc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7fe0f51c7353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f678b58c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f678c865c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f678c86aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f678c86bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f67d8304e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f67dd34b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f67dd116353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f678b58c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f678c4ef119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f67d8304e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f67dd34b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f67dd116353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 29] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:577] [Rank 29] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 25] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:583] [Rank 29] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:577] [Rank 25] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:583] [Rank 25] To avoid data inconsistency, we are taking the entire process down. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe061033897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fe06230cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe062311a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe062312dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fe0addabe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6ca8039897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #5: + 0x8609 (0x7fe0b2df2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fe0b2bbd353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6ca9312c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6ca9317a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6ca9318dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]:frame #4: + 0xd3e95 (0x7f6cf4db1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]: what(): [PG 2 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe061033897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fe06230cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #5: + 0x8609 (0x7f6cf9df8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f6cf9bc3353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 2 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fe062311a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe062312dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6ca8039897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6ca9312c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6ca9317a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6ca9318dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f6cf4db1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f6cf9df8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f6cf9bc3353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default5]:frame #4: + 0xd3e95 (0x7fe0addabe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fe0b2df2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6ca8039897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #6: clone + 0x43 (0x7fe0b2bbd353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default1]:frame #1: + 0xe32119 (0x7f6ca8f9c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe061033897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7fe061f96119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f6cf4db1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #2: + 0xd3e95 (0x7fe0addabe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f6cf9df8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f6cf9bc3353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default5]:frame #3: + 0x8609 (0x7fe0b2df2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7fe0b2bbd353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 28] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:577] [Rank 28] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:583] [Rank 28] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88d5ab1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f88d6d8ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f88d6d8fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f88d6d90dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f8922829e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f8927870609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f892763b353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 2 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88d5ab1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f88d6d8ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f88d6d8fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f88d6d90dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f8922829e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f8927870609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f892763b353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f88d5ab1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f88d6a14119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f8922829e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f8927870609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f892763b353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 16] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:577] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:583] [Rank 16] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0b996cc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 21] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:577] [Rank 21] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:583] [Rank 21] To avoid data inconsistency, we are taking the entire process down. [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0b9a9a5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600021 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0b9a9aaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3784459897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3785732c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3785737a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3785738dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f37d11d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f37d6218609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f37d5fe3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 2 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600021 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3784459897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3785732c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3785737a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3785738dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f37d11d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f37d6218609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f37d5fe3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0b9a9abdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f0be6444e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #5: + 0x8609 (0x7f0beb48b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3784459897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #6: clone + 0x43 (0x7f0beb256353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:frame #1: + 0xe32119 (0x7f37853bc119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f37d11d1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]:frame #3: + 0x8609 (0x7f37d6218609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f37d5fe3353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]: what(): [PG 2 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0b996cc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0b9a9a5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0b9a9aaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0b9a9abdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f0be6444e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f0beb48b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f0beb256353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0b996cc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f0b9a62f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f0be6444e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f0beb48b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f0beb256353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 22] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:577] [Rank 22] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:583] [Rank 22] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b83eab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7b85184c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b85189a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b8518adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f7bd0c23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f7bd5c6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f7bd5a35353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 2 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b83eab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7b85184c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b85189a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b8518adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f7bd0c23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f7bd5c6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f7bd5a35353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b83eab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f7b84e0e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f7bd0c23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f7bd5c6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f7bd5a35353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 23] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:577] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:583] [Rank 23] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5314eba897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5316193c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5316198a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5316199dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f5361c32e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f5366c79609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f5366a44353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 2 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5314eba897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5316193c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5316198a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5316199dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f5361c32e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f5366c79609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f5366a44353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5314eba897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f5315e1d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f5361c32e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f5366c79609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f5366a44353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 17] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:577] [Rank 17] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:583] [Rank 17] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff152586897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff15385fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff153864a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff153865dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7ff19f2fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7ff1a4345609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7ff1a4110353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 2 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff152586897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff15385fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff153864a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff153865dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7ff19f2fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7ff1a4345609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7ff1a4110353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff152586897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7ff1534e9119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7ff19f2fee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7ff1a4345609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7ff1a4110353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 18] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:577] [Rank 18] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:583] [Rank 18] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f190e6f6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f190f9cfc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f190f9d4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f190f9d5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f195b46ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f19604b5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f1960280353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 2 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f190e6f6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f190f9cfc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f190f9d4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f190f9d5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f195b46ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f19604b5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f1960280353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f190e6f6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f190f659119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f195b46ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f19604b5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f1960280353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 20] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:577] [Rank 20] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:583] [Rank 20] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d7a044897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6d7b31dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6d7b322a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6d7b323dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f6dc6dbce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f6dcbe03609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f6dcbbce353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 2 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d7a044897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6d7b31dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6d7b322a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6d7b323dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f6dc6dbce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f6dcbe03609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f6dcbbce353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d7a044897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f6d7afa7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f6dc6dbce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f6dcbe03609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f6dcbbce353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default1]:[rank41]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 9] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:577] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:583] [Rank 9] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5f1cccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5f1dfa5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5f1dfaaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5f1dfabdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f5f69a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f5f6ea8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f5f6e856353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 2 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5f1cccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5f1dfa5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5f1dfaaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5f1dfabdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f5f69a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f5f6ea8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f5f6e856353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5f1cccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f5f1dc2f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f5f69a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f5f6ea8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f5f6e856353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 12] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 11] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:577] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:583] [Rank 11] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4cf76a3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4cf897cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4cf8981a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4cf8982dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f4d4441be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fec292ad897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 15] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default3]:frame #5: + 0x8609 (0x7f4d49462609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fec2a586c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #6: clone + 0x43 (0x7f4d4922d353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]:[rank47]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fec2a58ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fec2a58cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]: what(): [PG 2 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #4: + 0xd3e95 (0x7fec76025e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bd32ee897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #5: + 0x8609 (0x7fec7b06c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4cf76a3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4bd45c7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4bd45cca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4cf897cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4bd45cddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4cf8981a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f4c20066e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f4c250ad609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fec7ae37353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4cf8982dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: [default3]:frame #4: + 0xd3e95 (0x7f4d4441be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f4d49462609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f4c24e78353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]:frame #6: clone + 0x43 (0x7f4d4922d353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default3]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 2 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fec292ad897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4cf76a3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: what(): [PG 2 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fec2a586c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #1: + 0xe32119 (0x7f4cf8606119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fec2a58ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bd32ee897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4bd45c7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fec2a58cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fec76025e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4bd45cca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #5: + 0x8609 (0x7fec7b06c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #2: + 0xd3e95 (0x7f4d4441be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4bd45cddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #6: clone + 0x43 (0x7fec7ae37353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]:frame #3: + 0x8609 (0x7f4d49462609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: + 0xd3e95 (0x7f4c20066e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]: [default7]:frame #5: + 0x8609 (0x7f4c250ad609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f4d4922d353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default7]:frame #6: clone + 0x43 (0x7f4c24e78353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fec292ad897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7fec2a210119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7fec76025e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7fec7b06c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bd32ee897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f4bd4251119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: clone + 0x43 (0x7fec7ae37353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 8] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default7]:frame #2: + 0xd3e95 (0x7f4c20066e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]: [default0]:[rank40]:[E ProcessGroupNCCL.cpp:577] [Rank 8] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:frame #3: + 0x8609 (0x7f4c250ad609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:[rank40]:[E ProcessGroupNCCL.cpp:583] [Rank 8] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff1e313e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff1e4417c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff1e441ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff1e441ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff22feb6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff234efd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f4c24e78353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]:frame #6: clone + 0x43 (0x7ff234cc8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: [default0]: what(): [PG 2 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff1e313e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff1e4417c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff1e441ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff1e441ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7ff22feb6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7ff234efd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7ff234cc8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff1e313e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7ff1e40a1119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7ff22feb6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7ff234efd609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7ff234cc8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 13] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b3aa67897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1b3bd40c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1b3bd45a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1b3bd46dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f1b877dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f1b8c826609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f1b8c5f1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 2 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b3aa67897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1b3bd40c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1b3bd45a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1b3bd46dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f1b877dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f1b8c826609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f1b8c5f1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b3aa67897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f1b3b9ca119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f1b877dfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f1b8c826609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f1b8c5f1353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 14] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e9c4a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f27eaf23c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f27eaf28a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f27eaf29dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f28369c2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f283ba09609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f283b7d4353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 2 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e9c4a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f27eaf23c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f27eaf28a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f27eaf29dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f28369c2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f283ba09609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f283b7d4353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e9c4a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f27eabad119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f28369c2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f283ba09609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f283b7d4353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default2]:[rank42]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 10] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:577] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:583] [Rank 10] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bb8b25897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4bb9dfec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4bb9e03a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4bb9e04dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f4c0589de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f4c0a8e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f4c0a6af353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 2 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bb8b25897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f4bb9dfec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4bb9e03a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4bb9e04dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f4c0589de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f4c0a8e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f4c0a6af353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bb8b25897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f4bb9a88119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f4c0589de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f4c0a8e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f4c0a6af353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 19] Timeout at NCCL work: 69, last enqueued NCCL work: 69, last completed NCCL work: 68. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:577] [Rank 19] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:583] [Rank 19] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb863ccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb864fa5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb864faaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb864fabdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fb8b0a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fb8b5a8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fb8b5856353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 2 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=268435456, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb863ccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb864fa5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb864faaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb864fabdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fb8b0a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fb8b5a8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fb8b5856353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb863ccc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7fb864c2f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7fb8b0a44e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7fb8b5a8b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7fb8b5856353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: W0703 04:19:34.443000 140430287660864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1670730 closing signal SIGTERM W0703 04:19:34.444000 140430287660864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1670731 closing signal SIGTERM W0703 04:19:34.444000 140430287660864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1670732 closing signal SIGTERM W0703 04:19:34.444000 140430287660864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1670733 closing signal SIGTERM E0703 04:19:35.464000 140430287660864 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1670726) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_04:19:34 host : ip-26-0-162-233.ec2.internal rank : 1 (local_rank: 1) exitcode : -6 (pid: 1670727) error_file: traceback : Signal 6 (SIGABRT) received by PID 1670727 [2]: time : 2024-07-03_04:19:34 host : ip-26-0-162-233.ec2.internal rank : 2 (local_rank: 2) exitcode : -6 (pid: 1670728) error_file: traceback : Signal 6 (SIGABRT) received by PID 1670728 [3]: time : 2024-07-03_04:19:34 host : ip-26-0-162-233.ec2.internal rank : 3 (local_rank: 3) exitcode : -6 (pid: 1670729) error_file: traceback : Signal 6 (SIGABRT) received by PID 1670729 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_04:19:34 host : ip-26-0-162-233.ec2.internal rank : 0 (local_rank: 0) exitcode : -6 (pid: 1670726) error_file: traceback : Signal 6 (SIGABRT) received by PID 1670726 ============================================================ srun: error: ip-26-0-162-233: task 0: Exited with exit code 1 W0703 04:19:38.196000 140120415590144 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_846724_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:38.269000 140269987243776 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_417204_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:38.991000 140423078258432 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1315236_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.103000 139904079726336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_906642_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.121000 140706903525120 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_333077_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.149000 140698831050496 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-147.ec2.internal_804157_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.183000 140509718718208 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-166-125.ec2.internal_32970_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.405000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315306 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315307 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315308 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315309 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315310 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315311 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315312 closing signal SIGTERM W0703 04:19:39.406000 140428738991936 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1315313 closing signal SIGTERM W0703 04:19:39.403000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33040 closing signal SIGTERM W0703 04:19:39.404000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33041 closing signal SIGTERM W0703 04:19:39.404000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33042 closing signal SIGTERM W0703 04:19:39.404000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33043 closing signal SIGTERM W0703 04:19:39.404000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33044 closing signal SIGTERM W0703 04:19:39.407000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33045 closing signal SIGTERM W0703 04:19:39.409000 139909740459840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 906716 closing signal SIGTERM W0703 04:19:39.409000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33046 closing signal SIGTERM W0703 04:19:39.409000 140515379451712 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 33047 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333147 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333148 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333149 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333150 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333151 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333152 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333153 closing signal SIGTERM W0703 04:19:39.420000 140712564258624 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 333154 closing signal SIGTERM W0703 04:19:39.440000 140275647977280 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 417276 closing signal SIGTERM W0703 04:19:39.440000 140275647977280 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 417277 closing signal SIGTERM W0703 04:19:39.440000 140275647977280 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 417280 closing signal SIGTERM W0703 04:19:39.440000 140275647977280 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 417282 closing signal SIGTERM W0703 04:19:39.441000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804228 closing signal SIGTERM W0703 04:19:39.441000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804229 closing signal SIGTERM W0703 04:19:39.441000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804230 closing signal SIGTERM W0703 04:19:39.441000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804231 closing signal SIGTERM W0703 04:19:39.442000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804232 closing signal SIGTERM W0703 04:19:39.442000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804233 closing signal SIGTERM W0703 04:19:39.442000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 804235 closing signal SIGTERM E0703 04:19:39.629000 139909740459840 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 906712) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 04:19:39.641000 139909740459840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_906642_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. E0703 04:19:39.648000 140126076323648 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 846793) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 04:19:39.660000 140126076323648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_846724_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.671000 139909740459840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_906642_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.694000 140126076323648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_846724_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:39.697000 139909740459840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_906642_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 25 (local_rank: 1) exitcode : -6 (pid: 906713) error_file: traceback : Signal 6 (SIGABRT) received by PID 906713 [2]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 26 (local_rank: 2) exitcode : -6 (pid: 906714) error_file: traceback : Signal 6 (SIGABRT) received by PID 906714 [3]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 27 (local_rank: 3) exitcode : -6 (pid: 906715) error_file: traceback : Signal 6 (SIGABRT) received by PID 906715 [4]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 29 (local_rank: 5) exitcode : -6 (pid: 906717) error_file: traceback : Signal 6 (SIGABRT) received by PID 906717 [5]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 30 (local_rank: 6) exitcode : -6 (pid: 906718) error_file: traceback : Signal 6 (SIGABRT) received by PID 906718 [6]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 31 (local_rank: 7) exitcode : -6 (pid: 906719) error_file: traceback : Signal 6 (SIGABRT) received by PID 906719 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_04:19:39 host : ip-26-0-165-24.ec2.internal rank : 24 (local_rank: 0) exitcode : -6 (pid: 906712) error_file: traceback : Signal 6 (SIGABRT) received by PID 906712 ============================================================ W0703 04:19:39.724000 140126076323648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_846724_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 57 (local_rank: 1) exitcode : -6 (pid: 846794) error_file: traceback : Signal 6 (SIGABRT) received by PID 846794 [2]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 58 (local_rank: 2) exitcode : -6 (pid: 846795) error_file: traceback : Signal 6 (SIGABRT) received by PID 846795 [3]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 59 (local_rank: 3) exitcode : -6 (pid: 846796) error_file: traceback : Signal 6 (SIGABRT) received by PID 846796 [4]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 60 (local_rank: 4) exitcode : -6 (pid: 846797) error_file: traceback : Signal 6 (SIGABRT) received by PID 846797 [5]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 61 (local_rank: 5) exitcode : -6 (pid: 846798) error_file: traceback : Signal 6 (SIGABRT) received by PID 846798 [6]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 62 (local_rank: 6) exitcode : -6 (pid: 846799) error_file: traceback : Signal 6 (SIGABRT) received by PID 846799 [7]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 63 (local_rank: 7) exitcode : -6 (pid: 846800) error_file: traceback : Signal 6 (SIGABRT) received by PID 846800 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_04:19:39 host : ip-26-0-174-36.ec2.internal rank : 56 (local_rank: 0) exitcode : -6 (pid: 846793) error_file: traceback : Signal 6 (SIGABRT) received by PID 846793 ============================================================ srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 srun: error: ip-26-0-165-24: task 3: Exited with exit code 1 E0703 04:19:40.849000 140275647977280 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 417275) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 04:19:40.862000 140275647977280 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_417204_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:40.893000 140275647977280 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_417204_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:40.910000 140275647977280 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_417204_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_04:19:39 host : ip-26-0-164-207.ec2.internal rank : 19 (local_rank: 3) exitcode : -6 (pid: 417278) error_file: traceback : Signal 6 (SIGABRT) received by PID 417278 [2]: time : 2024-07-03_04:19:39 host : ip-26-0-164-207.ec2.internal rank : 20 (local_rank: 4) exitcode : -6 (pid: 417279) error_file: traceback : Signal 6 (SIGABRT) received by PID 417279 [3]: time : 2024-07-03_04:19:39 host : ip-26-0-164-207.ec2.internal rank : 22 (local_rank: 6) exitcode : -6 (pid: 417281) error_file: traceback : Signal 6 (SIGABRT) received by PID 417281 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_04:19:39 host : ip-26-0-164-207.ec2.internal rank : 16 (local_rank: 0) exitcode : -6 (pid: 417275) error_file: traceback : Signal 6 (SIGABRT) received by PID 417275 ============================================================ E0703 04:19:41.570000 140704491784000 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 6 (pid: 804234) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 04:19:41.582000 140704491784000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_804157_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:41.615000 140704491784000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_804157_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:41.625000 140704491784000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_804157_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_04:19:39 host : ip-26-0-163-147.ec2.internal rank : 14 (local_rank: 6) exitcode : -6 (pid: 804234) error_file: traceback : Signal 6 (SIGABRT) received by PID 804234 ============================================================ srun: error: ip-26-0-164-207: task 2: Exited with exit code 1 srun: error: ip-26-0-163-147: task 1: Exited with exit code 1 W0703 04:19:42.936000 140712564258624 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_333077_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:42.954000 140712564258624 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_333077_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 04:19:43.962000 140428738991936 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1315236_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:43.978000 140428738991936 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1315236_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 04:19:44.188000 140509718718208 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-166-125.ec2.internal_32970_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. srun: error: ip-26-0-173-202: task 5: Exited with exit code 1 W0703 04:19:49.192000 140509718718208 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-166-125.ec2.internal_32970_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:51.756000 140515379451712 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-166-125.ec2.internal_32970_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 04:19:51.772000 140515379451712 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-166-125.ec2.internal_32970_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-166-125: task 4: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.