======================== START TIME: Tue Jul 2 23:26:16 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0702 23:26:21.655000 139850683533120 torch/distributed/run.py:757] W0702 23:26:21.655000 139850683533120 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.655000 139850683533120 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.655000 139850683533120 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.752000 139855107929920 torch/distributed/run.py:757] W0702 23:26:21.752000 139855107929920 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.752000 139855107929920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.752000 139855107929920 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.886000 140580253726528 torch/distributed/run.py:757] W0702 23:26:21.886000 140580253726528 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.886000 140580253726528 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.886000 140580253726528 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.886000 139668319082304 torch/distributed/run.py:757] W0702 23:26:21.886000 139668319082304 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.886000 139668319082304 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.886000 139668319082304 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.980000 140582428587840 torch/distributed/run.py:757] W0702 23:26:21.980000 140582428587840 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.980000 140582428587840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.980000 140582428587840 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.986000 140421053228864 torch/distributed/run.py:757] W0702 23:26:21.986000 140421053228864 torch/distributed/run.py:757] ***************************************** W0702 23:26:21.986000 140421053228864 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:21.986000 140421053228864 torch/distributed/run.py:757] ***************************************** W0702 23:26:22.050000 140357880190784 torch/distributed/run.py:757] W0702 23:26:22.050000 140357880190784 torch/distributed/run.py:757] ***************************************** W0702 23:26:22.050000 140357880190784 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:22.050000 140357880190784 torch/distributed/run.py:757] ***************************************** W0702 23:26:22.123000 139704473855808 torch/distributed/run.py:757] W0702 23:26:22.123000 139704473855808 torch/distributed/run.py:757] ***************************************** W0702 23:26:22.123000 139704473855808 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 23:26:22.123000 139704473855808 torch/distributed/run.py:757] ***************************************** [default0]:07/02/2024 23:26:48 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Vocab Size Padding] Padded vocab (size: 50257) with 15 dummy tokens (new size: 50272) [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config: [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: run='%date_%jobid', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: step=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: consumed_train_samples=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: benchmark_csv_path=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ignore_sanity_checks=True), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: parallelism=ParallelismArgs(dp=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp=4, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp=16, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp_engine=, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_mode=, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_linear_async_communication=False, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: expert_parallel_size=1), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50272), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: init_method=RandomInit(std=0.025), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dtype=torch.bfloat16, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: make_vocab_size_divisible_by=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ddp_bucket_cap_mb=25), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_revision=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_max_length=None), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoint_interval=100000, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: save_initial_state=False, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: resume_checkpoint_path=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints_path_is_shared_file_system=False), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: logging=LoggingArgs(log_level='info', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: log_level_replica='info', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: iteration_step_info_interval=1), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokens=TokensArgs(sequence_length=4096, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: train_steps=20, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: micro_batch_size=4, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: batch_accumulation_per_replica=256, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: val_check_interval=-1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_val_batches=0, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_test_batches=0), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta1=0.9, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta2=0.95, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: torch_adam_is_fused=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: name='adamW'), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: zero_stage=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: weight_decay=0.01, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: clip_grad=1.0, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: accumulate_grad_in_fp32=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_steps=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_style='linear', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_style='linear', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_steps=19, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_starting_step=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: min_decay_lr=1e-05)), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: start_training_step=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_splits='train', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_config_name=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_processing_num_proc_per_process=64, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_overwrite_cache=False, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: text_column_name='text'), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_loading_workers=0))], [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-16_pp-4_mbz-4')), [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lighteval=None) [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Model Config: [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: LlamaConfig(bos_token_id=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50272) [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Building model.. [default0]:07/02/2024 23:26:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Setting PP block ranks... [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-233]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-233]: No checkpoint path provided. [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-225]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-225]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=7|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=7|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=7|ip-26-0-160-225]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=5|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=5|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=5|ip-26-0-160-225]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-233]: No checkpoint path provided. [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: No checkpoint path provided. [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-233]: Local number of parameters: 15.8M (30.05MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-233]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Total number of parameters: 1.21G (2315.81MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Parametrizing model parameters using StandardParametrizator [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-103]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=4|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=4|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=4|ip-26-0-160-225]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=6|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=6|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=6|ip-26-0-160-225]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-103]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-103]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-103]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-103]: No checkpoint path provided. [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: No checkpoint path provided. [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: No checkpoint path provided. [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default2]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: No checkpoint path provided. [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: No checkpoint path provided. [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default4]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: No checkpoint path provided. [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/02/2024 23:27:06 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default7]:07/02/2024 23:27:06 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/02/2024 23:27:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/02/2024 23:27:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/02/2024 23:27:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 0 has 24.8M out of 24.8M (100.00%) params' optimizer states [default0]:07/02/2024 23:27:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/02/2024 23:27:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Using `datasets` library [default0]:07/02/2024 23:27:10 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:10 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] There are 1 training stages [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Stage Training Stage] start from step 1 [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Start training] datetime: 2024-07-02 23:27:12.188738 | mbs: 4 | grad_accum: 256 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/02/2024 23:27:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Memory usage: 244.38MiB. Peak allocated 244.38MiB. Peak reserved: 266.00MiB [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=4|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=2|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=3|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=9|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=11|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=11|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=14|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=9|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=10|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=12|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=13|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=8|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=5|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=8|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=10|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=10|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=7|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=8|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=14|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=4|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=6|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=15|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=9|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=13|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=13|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=8|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=6|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=9|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=3|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=10|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=2|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=5|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=15|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=7|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=1|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=7|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=6|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=3|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=12|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=2|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=4|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=6|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=14|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=7|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=15|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=5|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=12|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=1|TP=4|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=12|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=11|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=13|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=14|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=11|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 23:27:12 [WARNING|DP=0|PP=3|TP=5|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 23:27:12 [WARNING|DP=0|PP=0|TP=15|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 23:27:12 [WARNING|DP=0|PP=2|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank24]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank9]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank9]: grad_accumulator.backward(sum(activations)) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank9]: result = loss.backward() [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank9]: torch.autograd.backward( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank9]: _engine_run_backward( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank9]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank9]: return user_fn(self, *args) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank9]: send_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank9]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank9]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank9]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank9]: dist.send( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank9]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank6]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank6]: grad_accumulator.backward(sum(activations)) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank6]: result = loss.backward() [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank6]: torch.autograd.backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank6]: _engine_run_backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank6]: return user_fn(self, *args) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank6]: send_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank6]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank6]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank6]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank6]: dist.send( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank6]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank11]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank11]: grad_accumulator.backward(sum(activations)) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank11]: result = loss.backward() [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank11]: torch.autograd.backward( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank11]: _engine_run_backward( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank11]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank11]: return user_fn(self, *args) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank11]: send_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank11]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank11]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank11]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank11]: dist.send( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank11]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank5]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank5]: grad_accumulator.backward(sum(activations)) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank5]: result = loss.backward() [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank5]: torch.autograd.backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank5]: _engine_run_backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank5]: return user_fn(self, *args) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank5]: send_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank5]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank5]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank5]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank5]: dist.send( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank5]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank17]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank17]: grad_accumulator.backward(sum(activations)) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank17]: result = loss.backward() [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank17]: torch.autograd.backward( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank17]: _engine_run_backward( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank17]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank17]: return user_fn(self, *args) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank17]: pipeline_state.run_communication() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank17]: send_activation() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank17]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank17]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank17]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank17]: dist.send( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank17]: return func(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank17]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank17]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank19]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank19]: grad_accumulator.backward(sum(activations)) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank19]: result = loss.backward() [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank19]: torch.autograd.backward( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank19]: _engine_run_backward( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank19]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank19]: return user_fn(self, *args) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank19]: pipeline_state.run_communication() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank19]: send_activation() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank19]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank19]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank19]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank19]: dist.send( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank19]: return func(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank19]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank19]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: trainer.train(dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank27]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank27]: grad_accumulator.backward(sum(activations)) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank27]: result = loss.backward() [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank27]: torch.autograd.backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank27]: _engine_run_backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank27]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank27]: return user_fn(self, *args) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank27]: pipeline_state.run_communication() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank27]: send_activation() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank27]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank27]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank27]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank27]: dist.send( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank27]: return func(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank27]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank27]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank25]: Traceback (most recent call last): [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank25]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank25]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank25]: grad_accumulator.backward(sum(activations)) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank25]: result = loss.backward() [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank25]: torch.autograd.backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank25]: _engine_run_backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank25]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank25]: return user_fn(self, *args) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank25]: pipeline_state.run_communication() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank25]: send_activation() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank25]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank25]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank25]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank25]: dist.send( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank25]: return func(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank25]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank25]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank0]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank0]: grad_accumulator.backward(sum(activations)) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank0]: result = loss.backward() [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank0]: torch.autograd.backward( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank0]: _engine_run_backward( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank0]: return user_fn(self, *args) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank0]: pipeline_state.run_communication() [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank0]: send_activation() [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank0]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank0]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank0]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank0]: dist.send( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank0]: return func(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank0]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank23]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank23]: grad_accumulator.backward(sum(activations)) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank23]: result = loss.backward() [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank23]: torch.autograd.backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank23]: _engine_run_backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank23]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank23]: return user_fn(self, *args) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank23]: pipeline_state.run_communication() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank23]: send_activation() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank23]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank23]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank23]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank23]: dist.send( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank23]: return func(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank23]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank23]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank1]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank1]: grad_accumulator.backward(sum(activations)) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank1]: result = loss.backward() [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank1]: torch.autograd.backward( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank1]: _engine_run_backward( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank1]: return user_fn(self, *args) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank1]: pipeline_state.run_communication() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank1]: send_activation() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank1]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank1]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank1]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank1]: dist.send( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank1]: return func(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank1]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank21]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank21]: grad_accumulator.backward(sum(activations)) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank21]: result = loss.backward() [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank21]: torch.autograd.backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank21]: _engine_run_backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank21]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank21]: return user_fn(self, *args) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank21]: pipeline_state.run_communication() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank21]: send_activation() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank21]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank21]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank21]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank21]: dist.send( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: return func(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank21]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank24]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank24]: grad_accumulator.backward(sum(activations)) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank24]: result = loss.backward() [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank24]: torch.autograd.backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank24]: _engine_run_backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank24]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank24]: return user_fn(self, *args) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank24]: pipeline_state.run_communication() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank24]: send_activation() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank24]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank24]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank24]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank24]: dist.send( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank24]: return func(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank24]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank24]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f09c05897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3f0aedec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3f0aee3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f0aee4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f3f5697de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f3f5b9c4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f3f5b78f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f09c05897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3f0aedec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3f0aee3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f0aee4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f3f5697de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f3f5b9c4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f3f5b78f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f09c05897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f3f0ab68119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f3f5697de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f3f5b9c4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f3f5b78f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank8]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank8]: grad_accumulator.backward(sum(activations)) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank8]: result = loss.backward() [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank8]: torch.autograd.backward( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank8]: _engine_run_backward( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank8]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank8]: return user_fn(self, *args) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank8]: pipeline_state.run_communication() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank8]: send_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank8]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank8]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank8]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank8]: dist.send( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank8]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f7248f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3f73768c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3f7376da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f7376edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f3fbf207e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f3fc424e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f3fc4019353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f7248f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3f73768c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3f7376da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f7376edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f3fbf207e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f3fc424e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f3fc4019353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f7248f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f3f733f2119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f3fbf207e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f3fc424e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f3fc4019353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank15]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank15]: grad_accumulator.backward(sum(activations)) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank15]: result = loss.backward() [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank15]: torch.autograd.backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank15]: _engine_run_backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank15]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank15]: return user_fn(self, *args) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank15]: send_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank15]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank15]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank15]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank15]: dist.send( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank15]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f94fd003897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f94fe2dcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f94fe2e1a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f94fe2e2dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f9549d7be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f954edc2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f954eb8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f94fd003897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f94fe2dcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f94fe2e1a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f94fe2e2dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f9549d7be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f954edc2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f954eb8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f94fd003897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f94fdf66119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f9549d7be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f954edc2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f954eb8d353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default7]:[rank31]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank31]: trainer.train(dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank31]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank31]: grad_accumulator.backward(sum(activations)) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank31]: result = loss.backward() [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank31]: torch.autograd.backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank31]: _engine_run_backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank31]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank31]: return user_fn(self, *args) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank31]: pipeline_state.run_communication() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank31]: send_activation() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank31]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank31]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank31]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank31]: dist.send( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: return func(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank31]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank31]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank3]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank3]: grad_accumulator.backward(sum(activations)) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank3]: result = loss.backward() [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank3]: torch.autograd.backward( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank3]: _engine_run_backward( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank3]: return user_fn(self, *args) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank3]: pipeline_state.run_communication() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank3]: send_activation() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank3]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank3]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank3]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank3]: dist.send( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank3]: return func(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank3]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank3]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank4]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank4]: grad_accumulator.backward(sum(activations)) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank4]: result = loss.backward() [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank4]: torch.autograd.backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank4]: _engine_run_backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank4]: return user_fn(self, *args) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank4]: send_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank4]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank4]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank4]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank4]: dist.send( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank4]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank29]: Traceback (most recent call last): [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank29]: trainer.train(dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank29]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank29]: grad_accumulator.backward(sum(activations)) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank29]: result = loss.backward() [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank29]: torch.autograd.backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank29]: _engine_run_backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank29]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank29]: return user_fn(self, *args) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank29]: pipeline_state.run_communication() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank29]: send_activation() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank29]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank29]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank29]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank29]: dist.send( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank29]: return func(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank29]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank29]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank10]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank10]: grad_accumulator.backward(sum(activations)) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank10]: result = loss.backward() [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank10]: torch.autograd.backward( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank10]: _engine_run_backward( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank10]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank10]: return user_fn(self, *args) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank10]: send_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank10]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank10]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank10]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank10]: dist.send( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank10]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank16]: trainer.train(dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank16]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank16]: grad_accumulator.backward(sum(activations)) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank16]: result = loss.backward() [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank16]: torch.autograd.backward( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank16]: _engine_run_backward( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank16]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank16]: return user_fn(self, *args) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank16]: pipeline_state.run_communication() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank16]: send_activation() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank16]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank16]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank16]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank16]: dist.send( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank16]: return func(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank16]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank16]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa4d89d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa4d9caec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa4d9cb3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa4d9cb4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fa52574de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fa52a794609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fa52a55f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa4d89d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa4d9caec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa4d9cb3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa4d9cb4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fa52574de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fa52a794609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fa52a55f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa4d89d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7fa4d9938119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7fa52574de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7fa52a794609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7fa52a55f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f187c3c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f187d6a2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f187d6a7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f187d6a8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f18c9141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f18ce188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f18cdf53353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f187c3c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f187d6a2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f187d6a7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f187d6a8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f18c9141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f18ce188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f18cdf53353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f187c3c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f187d32c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f18c9141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f18ce188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f18cdf53353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank7]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank7]: grad_accumulator.backward(sum(activations)) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank7]: result = loss.backward() [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742b99c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank7]: torch.autograd.backward( [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f742cc75c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f742cc7aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f742cc7bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:frame #4: + 0xd3e95 (0x7f7478714e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f747d75b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:[rank7]: _engine_run_backward( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:frame #6: clone + 0x43 (0x7f747d526353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742b99c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f742cc75c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: return user_fn(self, *args) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f742cc7aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: send_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank7]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f742cc7bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank7]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:frame #4: + 0xd3e95 (0x7f7478714e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f747d75b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f747d526353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank7]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank7]: dist.send( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742b99c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:frame #1: + 0xe32119 (0x7f742c8ff119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:frame #2: + 0xd3e95 (0x7f7478714e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f747d75b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f747d526353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: Traceback (most recent call last): [default2]:[rank26]: trainer.train(dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: trainer.train(dataloader) [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank30]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank30]: grad_accumulator.backward(sum(activations)) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank30]: result = loss.backward() [default2]:[rank26]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank26]: grad_accumulator.backward(sum(activations)) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank26]: result = loss.backward() [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank26]: torch.autograd.backward( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank26]: _engine_run_backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank30]: torch.autograd.backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank30]: _engine_run_backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank30]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank30]: return user_fn(self, *args) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank26]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank26]: return user_fn(self, *args) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank26]: pipeline_state.run_communication() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank26]: send_activation() [default6]:[rank30]: pipeline_state.run_communication() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank26]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank30]: send_activation() [default2]:[rank26]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank26]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank26]: dist.send( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank26]: return func(*args, **kwargs) [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank2]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank2]: grad_accumulator.backward(sum(activations)) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank2]: result = loss.backward() [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank2]: torch.autograd.backward( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank2]: _engine_run_backward( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank2]: return user_fn(self, *args) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank2]: pipeline_state.run_communication() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank2]: send_activation() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank2]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank2]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank2]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank2]: dist.send( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank2]: return func(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank2]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank2]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank12]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank12]: grad_accumulator.backward(sum(activations)) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: result = loss.backward() [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank12]: torch.autograd.backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank12]: _engine_run_backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank13]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank12]: return user_fn(self, *args) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank12]: pipeline_state.run_communication() [default5]:[rank13]: grad_accumulator.backward(sum(activations)) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank12]: send_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank13]: result = loss.backward() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank12]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank12]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank12]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank13]: torch.autograd.backward( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank13]: _engine_run_backward( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank12]: dist.send( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default5]:[rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank12]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank13]: return user_fn(self, *args) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank13]: send_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank13]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank13]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank13]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank13]: dist.send( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank13]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank30]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank26]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank26]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank30]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank30]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank30]: dist.send( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank30]: return func(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank30]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank30]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank28]: trainer.train(dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank28]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank28]: grad_accumulator.backward(sum(activations)) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank28]: result = loss.backward() [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank28]: torch.autograd.backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank28]: _engine_run_backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank28]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank28]: return user_fn(self, *args) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank28]: pipeline_state.run_communication() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank28]: send_activation() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank28]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank28]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank28]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank28]: dist.send( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank28]: return func(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank28]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank28]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank22]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank22]: grad_accumulator.backward(sum(activations)) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank22]: result = loss.backward() [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank22]: torch.autograd.backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank22]: _engine_run_backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank22]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank22]: return user_fn(self, *args) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank22]: pipeline_state.run_communication() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank22]: send_activation() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank22]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank22]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank22]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank22]: dist.send( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank22]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank18]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank18]: grad_accumulator.backward(sum(activations)) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank18]: result = loss.backward() [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank18]: torch.autograd.backward( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank18]: _engine_run_backward( [default6]:[rank22]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank22]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank18]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank18]: return user_fn(self, *args) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank18]: pipeline_state.run_communication() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank18]: send_activation() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank18]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank18]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank18]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank18]: dist.send( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank18]: return func(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank18]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank18]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank14]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank14]: grad_accumulator.backward(sum(activations)) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank14]: result = loss.backward() [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank14]: torch.autograd.backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank14]: _engine_run_backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank14]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank14]: return user_fn(self, *args) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank14]: send_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank14]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank14]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank14]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank14]: dist.send( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank14]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank20]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank20]: grad_accumulator.backward(sum(activations)) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank20]: result = loss.backward() [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank20]: torch.autograd.backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank20]: _engine_run_backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank20]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank20]: return user_fn(self, *args) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank20]: send_activation() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank20]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank20]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank20]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank20]: dist.send( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank20]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank20]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default3]:[rank35]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb05bada897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb05cdb3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb05cdb8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb05cdb9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fb0a8852e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fb0ad899609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fb0ad664353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb05bada897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb05cdb3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogH[default2]:[rank58]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. andler() + 0x1a0 (0x7fb05cdb8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb05cdb9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7fb0a8852e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7fb0ad899609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7fb0ad664353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb05bada897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7fb05ca3d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7fb0a8852e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7fb0ad899609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7fb0ad664353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default2]:[rank50]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffb125d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ffb138b1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffb138b6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffb138b7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ffb5f350e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7ffb64397609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7ffb64162353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffb125d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ffb138b1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffb138b6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffb138b7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ffb5f350e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7ffb64397609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7ffb64162353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffb125d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7ffb1353b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7ffb5f350e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7ffb64397609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7ffb64162353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:[rank60]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dff7f8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0e00ad1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0e00ad6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0e00ad7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f0e4c570e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f0e515b7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f0e51382353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dff7f8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0e00ad1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0e00ad6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0e00ad7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f0e4c570e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f0e515b7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f0e51382353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dff7f8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f0e0075b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f0e4c570e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f0e515b7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f0e51382353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55640ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5565384c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5565389a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f556538adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f55b0e23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f55b5e6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f55b5c35353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55640ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5565384c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5565389a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f556538adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f55b0e23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f55b5e6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f55b5c35353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55640ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f556500e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f55b0e23e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f55b5e6a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f55b5c35353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default1]:[rank17]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa61dce5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa61efbec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa61efc3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa61efc4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fa66aa5de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fa66faa4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fa66f86f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa61dce5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa61efbec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa61efc3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa61efc4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fa66aa5de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fa66faa4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fa66f86f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa61dce5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7fa61ec48119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7fa66aa5de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7fa66faa4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7fa66f86f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97e726e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f97e8547c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f97e854ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f97e854ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f9833fe6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f983902d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f9838df8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97e726e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f97e8547c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f97e854ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f97e854ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f9833fe6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f983902d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f9838df8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97e726e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f97e81d1119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f9833fe6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f983902d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f9838df8353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default3]:[rank19]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48dccb5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f48ddf8ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f48ddf93a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48ddf94dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f4929a2de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f492ea74609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f492e83f353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48dccb5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f48ddf8ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f48ddf93a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48ddf94dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f4929a2de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f492ea74609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f492e83f353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48dccb5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f48ddc18119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f4929a2de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f492ea74609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f492e83f353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdf97323897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fdf985fcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fdf98601a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fdf98602dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fdfe409be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fdfe90e2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fdfe8ead353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdf97323897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fdf985fcc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fdf98601a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fdf98602dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fdfe409be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fdfe90e2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fdfe8ead353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdf97323897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fdf98286119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fdfe409be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fdfe90e2609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fdfe8ead353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbf50f75897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fbf5224ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbf52253a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbf52254dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fbf9dcede95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fbfa2d34609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fbfa2aff353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbf50f75897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fbf5224ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbf52253a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbf52254dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fbf9dcede95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fbfa2d34609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fbfa2aff353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbf50f75897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fbf51ed8119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fbf9dcede95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fbfa2d34609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fbfa2aff353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc4d0ee4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc4d21bdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4d21c2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4d21c3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fc51dc5ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fc522ca3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fc522a6e353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc4d0ee4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc4d21bdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4d21c2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4d21c3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fc51dc5ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fc522ca3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fc522a6e353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc4d0ee4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fc4d1e47119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fc51dc5ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fc522ca3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fc522a6e353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41e4fbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f41e6297c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f41e629ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41e629ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f4231d36e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f4236d7d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f4236b48353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41e4fbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f41e6297c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f41e629ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41e629ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f4231d36e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f4236d7d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f4236b48353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41e4fbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f41e5f21119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f4231d36e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f4236d7d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f4236b48353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default3]:[rank27]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0012087897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0013360c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0013365a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0013366dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f005edffe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f0063e46609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f0063c11353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0012087897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0013360c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0013365a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0013366dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f005edffe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f0063e46609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f0063c11353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0012087897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f0012fea119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f005edffe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f0063e46609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f0063c11353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5e4aa8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb5e5d81c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb5e5d86a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb5e5d87dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fb631820e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fb636867609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fb636632353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5e4aa8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb5e5d81c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb5e5d86a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb5e5d87dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fb631820e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fb636867609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fb636632353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default4]: [default5]:[rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6de5410897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6de66e9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5e4aa8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6de66eea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6de66efdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #1: + 0xe32119 (0x7fb5e5a0b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7fb631820e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7fb636867609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7fb636632353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]:frame #4: + 0xd3e95 (0x7f6e32188e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]: [default5]:frame #5: + 0x8609 (0x7f6e371cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f6e36f9a353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6de5410897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6de66e9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6de66eea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6de66efdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f6e32188e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f6e371cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f6e36f9a353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6de5410897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f6de6373119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f6e32188e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f6e371cf609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f6e36f9a353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:[rank61]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 0] Timeout at NCCL work: 22, last enqueued NCCL work: 23, last completed NCCL work: 21. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f29b9269897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f29ba542c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f29ba547a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f29ba548dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f2a05fe1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f2a0b028609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f2a0adf3353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f29b9269897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f29ba542c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f29ba547a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f29ba548dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f2a05fe1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f2a0b028609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f2a0adf3353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f29b9269897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f29ba1cc119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f2a05fe1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f2a0b028609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f2a0adf3353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank36]: Traceback (most recent call last): [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank36]: trainer.train(dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank36]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank36]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank36]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank36]: grad_accumulator.backward(sum(activations)) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank36]: result = loss.backward() [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank36]: torch.autograd.backward( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank36]: _engine_run_backward( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank36]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank36]: return user_fn(self, *args) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank36]: pipeline_state.run_communication() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default4]:[rank36]: self.grads_buffer.append(recv_grad()) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default4]:[rank36]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank36]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank36]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank36]: dist.recv( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank36]: return func(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank36]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank36]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank37]: Traceback (most recent call last): [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank37]: trainer.train(dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank37]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank37]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank37]: grad_accumulator.backward(sum(activations)) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank37]: result = loss.backward() [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank37]: torch.autograd.backward( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank37]: _engine_run_backward( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank37]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank37]: return user_fn(self, *args) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank37]: pipeline_state.run_communication() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default5]:[rank37]: self.grads_buffer.append(recv_grad()) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default5]:[rank37]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank37]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank37]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank37]: dist.recv( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: return func(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank38]: Traceback (most recent call last): [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank38]: trainer.train(dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank38]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank38]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank38]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank38]: grad_accumulator.backward(sum(activations)) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank38]: result = loss.backward() [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank38]: torch.autograd.backward( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank38]: _engine_run_backward( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank38]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank38]: return user_fn(self, *args) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank38]: pipeline_state.run_communication() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default6]:[rank38]: self.grads_buffer.append(recv_grad()) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default6]:[rank38]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank38]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank38]: dist.recv( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank38]: return func(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank38]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank38]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank32]: Traceback (most recent call last): [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank32]: trainer.train(dataloader) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank32]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank32]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank32]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank32]: grad_accumulator.backward(sum(activations)) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank32]: result = loss.backward() [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank32]: torch.autograd.backward( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank32]: _engine_run_backward( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank32]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank32]: return user_fn(self, *args) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank32]: pipeline_state.run_communication() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default0]:[rank32]: self.grads_buffer.append(recv_grad()) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default0]:[rank32]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank32]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank32]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank32]: dist.recv( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank32]: return func(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank32]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank32]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default3]:[rank35]: Traceback (most recent call last): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: trainer.train(dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank35]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank35]: grad_accumulator.backward(sum(activations)) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank35]: result = loss.backward() [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank35]: torch.autograd.backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank35]: _engine_run_backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank35]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank35]: return user_fn(self, *args) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank35]: pipeline_state.run_communication() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default3]:[rank35]: self.grads_buffer.append(recv_grad()) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default3]:[rank35]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank35]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank35]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank35]: dist.recv( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: return func(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank35]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank35]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank33]: Traceback (most recent call last): [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank33]: trainer.train(dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank33]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank33]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank33]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank33]: grad_accumulator.backward(sum(activations)) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank33]: result = loss.backward() [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank33]: torch.autograd.backward( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank33]: _engine_run_backward( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank33]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank33]: return user_fn(self, *args) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank33]: pipeline_state.run_communication() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default1]:[rank33]: self.grads_buffer.append(recv_grad()) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default1]:[rank33]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank33]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank33]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank33]: dist.recv( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank33]: return func(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank33]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank33]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4eadb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2b4fdb4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2b4fdb9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2b4fdbadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f2b9b853e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f2ba089a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f2ba0665353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4eadb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2b4fdb4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2b4fdb9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2b4fdbadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f2b9b853e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f2ba089a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f2ba0665353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4eadb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f2b4fa3e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f2b9b853e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f2ba089a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f2ba0665353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:[rank49]: Traceback (most recent call last): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank49]: trainer.train(dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank49]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank49]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank49]: output = model(**micro_batch) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank49]: sharded_logits = self.model( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank49]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank49]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank49]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank49]: recv_activation_tensor = recv_activation() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank49]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank49]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank49]: dist.recv( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank49]: return func(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank49]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank42]: Traceback (most recent call last): [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank42]: trainer.train(dataloader) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank42]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank42]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank42]: grad_accumulator.backward(sum(activations)) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank42]: result = loss.backward() [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank42]: torch.autograd.backward( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank42]: _engine_run_backward( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank42]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank42]: return user_fn(self, *args) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank42]: pipeline_state.run_communication() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default2]:[rank42]: self.grads_buffer.append(recv_grad()) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default2]:[rank42]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank42]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank42]: dist.recv( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank42]: return func(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank42]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank42]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97c3e4b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f97c5124c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f97c5129a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f97c512adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f9810bc3e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f9815c0a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f98159d5353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97c3e4b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f97c5124c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f97c5129a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f97c512adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f9810bc3e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f9815c0a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f98159d5353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97c3e4b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f97c4dae119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f9810bc3e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f9815c0a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f98159d5353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default5]:[rank61]: Traceback (most recent call last): [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank61]: trainer.train(dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank61]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank61]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank61]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank61]: output = model(**micro_batch) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank61]: sharded_logits = self.model( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank61]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank61]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank61]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank61]: pipeline_state.run_communication() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank61]: recv_activation_tensor = recv_activation() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank61]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank61]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank61]: dist.recv( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank61]: return func(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank51]: Traceback (most recent call last): [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank51]: trainer.train(dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank51]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank51]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank51]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank51]: output = model(**micro_batch) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank51]: sharded_logits = self.model( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank51]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank51]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank51]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: pipeline_state.run_communication() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank51]: recv_activation_tensor = recv_activation() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank51]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank51]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank51]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank51]: dist.recv( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank51]: return func(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank51]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank50]: Traceback (most recent call last): [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank50]: trainer.train(dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank50]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank50]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank50]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank50]: output = model(**micro_batch) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: sharded_logits = self.model( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank50]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank50]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank50]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank50]: pipeline_state.run_communication() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank50]: recv_activation_tensor = recv_activation() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank50]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank50]: dist.recv( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank50]: return func(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank50]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank50]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fce57a53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fce58d2cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fce58d31a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fce58d32dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fcea47cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fcea9812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fcea95dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fce57a53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fce58d2cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fce58d31a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fce58d32dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fcea47cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fcea9812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fcea95dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fce57a53897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7fce589b6119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fcea47cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fcea9812609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7fcea95dd353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default6]:[rank54]: Traceback (most recent call last): [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank54]: trainer.train(dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank54]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank54]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank54]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: output = model(**micro_batch) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: sharded_logits = self.model( [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank54]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank54]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank54]: pipeline_state.run_communication() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: recv_activation_tensor = recv_activation() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank54]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank54]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank54]: dist.recv( [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank54]: return func(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank54]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank46]: Traceback (most recent call last): [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank46]: trainer.train(dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank46]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank46]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank46]: grad_accumulator.backward(sum(activations)) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank46]: result = loss.backward() [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank46]: torch.autograd.backward( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank46]: _engine_run_backward( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank46]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank46]: return user_fn(self, *args) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank46]: pipeline_state.run_communication() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default6]:[rank46]: self.grads_buffer.append(recv_grad()) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default6]:[rank46]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank46]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank46]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank46]: dist.recv( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank46]: return func(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank46]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank46]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a3c1eb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a3d4c4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a3d4c9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a3d4cadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f5a88f63e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f5a8dfaa609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f5a8dd75353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a3c1eb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a3d4c4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a3d4c9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a3d4cadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default2]:frame #4: + 0xd3e95 (0x7f5a88f63e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:[rank22]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default2]:frame #5: + 0x8609 (0x7f5a8dfaa609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f5a8dd75353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd6e4099897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd6e5372c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd6e5377a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd6e5378dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fd730e11e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fd735e58609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fd735c23353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a3c1eb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f5a3d14e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #2: + 0xd3e95 (0x7f5a88f63e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f5a8dfaa609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd6e4099897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd6e5372c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd6e5377a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd6e5378dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: clone + 0x43 (0x7f5a8dd75353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default6]:frame #4: + 0xd3e95 (0x7fd730e11e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fd735e58609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fd735c23353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd6e4099897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7fd6e4ffc119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fd730e11e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fd735e58609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fd735c23353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default7]:[rank63]: Traceback (most recent call last): [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank63]: trainer.train(dataloader) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank63]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank63]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank63]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank63]: output = model(**micro_batch) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: sharded_logits = self.model( [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank63]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank63]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank63]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank63]: pipeline_state.run_communication() [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank63]: recv_activation_tensor = recv_activation() [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank63]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank63]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank63]: dist.recv( [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: return func(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank63]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank63]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank52]: Traceback (most recent call last): [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank52]: trainer.train(dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank52]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank52]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank52]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank52]: output = model(**micro_batch) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank52]: sharded_logits = self.model( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank52]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank52]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank52]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank52]: pipeline_state.run_communication() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: recv_activation_tensor = recv_activation() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank52]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank52]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank52]: dist.recv( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank52]: return func(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank52]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank52]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank62]: Traceback (most recent call last): [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank62]: trainer.train(dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank62]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank62]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank62]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank62]: output = model(**micro_batch) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank62]: sharded_logits = self.model( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank62]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank62]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank62]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank62]: pipeline_state.run_communication() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank62]: recv_activation_tensor = recv_activation() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank62]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank62]: dist.recv( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank62]: return func(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank62]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank62]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank58]: Traceback (most recent call last): [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank58]: trainer.train(dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank58]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank58]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank58]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank58]: output = model(**micro_batch) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank58]: sharded_logits = self.model( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank58]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank58]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: pipeline_state.run_communication() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank58]: recv_activation_tensor = recv_activation() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank58]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank58]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank58]: dist.recv( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: return func(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank58]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank58]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3feb215897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3fec4eec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3fec4f3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3fec4f4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f4037f8de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f403cfd4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f403cd9f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3feb215897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3fec4eec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3fec4f3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3fec4f4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f4037f8de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f403cfd4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f403cd9f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3feb215897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f3fec178119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f4037f8de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f403cfd4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f403cd9f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default7]:[rank39]: Traceback (most recent call last): [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank39]: trainer.train(dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank39]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank39]: grad_accumulator.backward(sum(activations)) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank39]: result = loss.backward() [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank39]: torch.autograd.backward( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank39]: _engine_run_backward( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank39]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank39]: return user_fn(self, *args) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank39]: pipeline_state.run_communication() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default7]:[rank39]: self.grads_buffer.append(recv_grad()) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default7]:[rank39]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank39]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank39]: dist.recv( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: return func(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank39]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank39]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank45]: Traceback (most recent call last): [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: trainer.train(dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank45]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank45]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank45]: grad_accumulator.backward(sum(activations)) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank45]: result = loss.backward() [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank45]: torch.autograd.backward( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank45]: _engine_run_backward( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank45]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank45]: return user_fn(self, *args) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank45]: pipeline_state.run_communication() [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default5]:[rank45]: self.grads_buffer.append(recv_grad()) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default5]:[rank45]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank45]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank45]: dist.recv( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: return func(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank45]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank44]: Traceback (most recent call last): [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank44]: trainer.train(dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank44]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank44]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank44]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank44]: grad_accumulator.backward(sum(activations)) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank44]: result = loss.backward() [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank44]: torch.autograd.backward( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank44]: _engine_run_backward( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank44]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank44]: return user_fn(self, *args) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank44]: pipeline_state.run_communication() [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default4]:[rank44]: self.grads_buffer.append(recv_grad()) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default4]:[rank44]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank44]: dist.recv( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank44]: return func(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank44]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank34]: Traceback (most recent call last): [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: trainer.train(dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank34]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank34]: grad_accumulator.backward(sum(activations)) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank34]: result = loss.backward() [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank34]: torch.autograd.backward( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank34]: _engine_run_backward( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank34]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank34]: return user_fn(self, *args) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank34]: pipeline_state.run_communication() [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default2]:[rank34]: self.grads_buffer.append(recv_grad()) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default2]:[rank34]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank34]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank34]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank34]: dist.recv( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: return func(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank34]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank34]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4b631897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2b4c90ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2b4c90fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2b4c910dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f2b983a9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f2b9d3f0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f2b9d1bb353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4b631897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2b4c90ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2b4c90fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2b4c910dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f2b983a9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f2b9d3f0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f2b9d1bb353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b4b631897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f2b4c594119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f2b983a9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f2b9d3f0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f2b9d1bb353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default3]:[rank43]: Traceback (most recent call last): [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank43]: trainer.train(dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank43]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank43]: grad_accumulator.backward(sum(activations)) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank43]: result = loss.backward() [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank43]: torch.autograd.backward( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank43]: _engine_run_backward( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank43]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank43]: return user_fn(self, *args) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank43]: pipeline_state.run_communication() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default3]:[rank43]: self.grads_buffer.append(recv_grad()) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default3]:[rank43]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank43]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank43]: dist.recv( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank43]: return func(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank43]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank43]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa8ac27a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa8ad553c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa8ad558a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa8ad559dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fa8f8ff2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fa8fe039609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fa8fde04353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa8ac27a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa8ad553c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa8ad558a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa8ad559dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fa8f8ff2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fa8fe039609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fa8fde04353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa8ac27a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fa8ad1dd119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fa8f8ff2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fa8fe039609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fa8fde04353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default0]:[rank40]: Traceback (most recent call last): [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank40]: trainer.train(dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank40]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank40]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank40]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank40]: grad_accumulator.backward(sum(activations)) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank40]: result = loss.backward() [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank40]: torch.autograd.backward( [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank40]: _engine_run_backward( [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank40]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank40]: return user_fn(self, *args) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank40]: pipeline_state.run_communication() [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default0]:[rank40]: self.grads_buffer.append(recv_grad()) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default0]:[rank40]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank40]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank40]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank40]: dist.recv( [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: return func(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f106fc24897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1070efdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1070f02a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1070f03dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f10bc99ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f10c19e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f10c17ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f106fc24897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1070efdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1070f02a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1070f03dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f10bc99ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f10c19e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f10c17ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f106fc24897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f1070b87119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f10bc99ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f10c19e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f10c17ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb255f4e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb257227c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb25722ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb25722ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fb2a2cc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fb2a7d0d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fb2a7ad8353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb255f4e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb257227c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb25722ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb25722ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fb2a2cc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fb2a7d0d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fb2a7ad8353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb255f4e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7fb256eb1119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fb2a2cc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fb2a7d0d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fb2a7ad8353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default2]:[rank26]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f028951f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f028a7f8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f028a7fda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f028a7fedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f02d6297e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f02db2de609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f02db0a9353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f028951f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f028a7f8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f028a7fda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f028a7fedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f02d6297e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f02db2de609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f02db0a9353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f028951f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f028a482119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f02d6297e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f02db2de609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f02db0a9353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 1] Timeout at NCCL work: 34, last enqueued NCCL work: 35, last completed NCCL work: 33. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd59e107897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd59f3e0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd59f3e5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd59f3e6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fd5eae7fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fd5efec6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fd5efc91353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=34, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd59e107897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd59f3e0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd59f3e5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd59f3e6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fd5eae7fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fd5efec6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fd5efc91353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd59e107897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7fd59f06a119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7fd5eae7fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7fd5efec6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7fd5efc91353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efed3f1d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efed51f6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efed51fba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efed51fcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7eff20c95e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7eff25cdc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7eff25aa7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efed3f1d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efed51f6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efed51fba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efed51fcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7eff20c95e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7eff25cdc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7eff25aa7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efed3f1d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7efed4e80119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7eff20c95e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7eff25cdc609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7eff25aa7353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default3]:[rank59]: Traceback (most recent call last): [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank59]: trainer.train(dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank59]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank59]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank59]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: output = model(**micro_batch) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank59]: sharded_logits = self.model( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank59]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank59]: pipeline_state.run_communication() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: recv_activation_tensor = recv_activation() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank59]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank59]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank59]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank59]: dist.recv( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank59]: return func(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank59]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank59]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank47]: Traceback (most recent call last): [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank47]: trainer.train(dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank47]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank47]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank47]: grad_accumulator.backward(sum(activations)) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank47]: result = loss.backward() [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank47]: torch.autograd.backward( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank47]: _engine_run_backward( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank47]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank47]: return user_fn(self, *args) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank47]: pipeline_state.run_communication() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default7]:[rank47]: self.grads_buffer.append(recv_grad()) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default7]:[rank47]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank47]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank47]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank47]: dist.recv( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank47]: return func(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank47]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank56]: Traceback (most recent call last): [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank56]: trainer.train(dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank56]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank56]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank56]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank56]: output = model(**micro_batch) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank56]: sharded_logits = self.model( [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank56]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank56]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank56]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank56]: pipeline_state.run_communication() [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank56]: recv_activation_tensor = recv_activation() [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank56]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank56]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank56]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank56]: dist.recv( [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank56]: return func(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank56]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank56]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank53]: Traceback (most recent call last): [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: sharded_logits = self.model( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default7]:[rank55]: Traceback (most recent call last): [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank55]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank55]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank55]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank55]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: pipeline_state.run_communication() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: output = model(**micro_batch) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank55]: sharded_logits = self.model( [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default5]:[rank53]: recv_activation_tensor = recv_activation() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank53]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank53]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank53]: dist.recv( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank53]: return func(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank53]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank55]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank55]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank55]: pipeline_state.run_communication() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: recv_activation_tensor = recv_activation() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank55]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank55]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank55]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank55]: dist.recv( [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: return func(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:[rank41]: Traceback (most recent call last): [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank41]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank41]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank41]: grad_accumulator.backward(sum(activations)) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank41]: result = loss.backward() [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank41]: torch.autograd.backward( [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank41]: _engine_run_backward( [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank41]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank41]: return user_fn(self, *args) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank41]: pipeline_state.run_communication() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default1]:[rank41]: self.grads_buffer.append(recv_grad()) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default1]:[rank41]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank41]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank41]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank41]: dist.recv( [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank41]: return func(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank41]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank41]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank60]: Traceback (most recent call last): [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: trainer.train(dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank60]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank60]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank60]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank60]: output = model(**micro_batch) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: sharded_logits = self.model( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank60]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank60]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank60]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank60]: pipeline_state.run_communication() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank60]: recv_activation_tensor = recv_activation() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank60]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank60]: dist.recv( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: return func(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank60]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank48]: Traceback (most recent call last): [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: trainer.train(dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank48]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank48]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank48]: output = model(**micro_batch) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank48]: sharded_logits = self.model( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank48]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank48]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank48]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank48]: pipeline_state.run_communication() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank48]: recv_activation_tensor = recv_activation() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank48]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank48]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank48]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank48]: dist.recv( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank48]: return func(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank48]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f12b5e09897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f12b70e2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f12b70e7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f12b70e8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f1302b81e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f1307bc8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f1307993353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f12b5e09897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f12b70e2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f12b70e7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f12b70e8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f1302b81e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f1307bc8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f1307993353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f12b5e09897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f12b6d6c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f1302b81e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f1307bc8609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f1307993353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3fd5b47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3fd6e20c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3fd6e25a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3fd6e26dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f40228bfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4027906609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f40276d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3fd5b47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3fd6e20c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3fd6e25a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3fd6e26dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f40228bfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4027906609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f40276d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3fd5b47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f3fd6aaa119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f40228bfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f4027906609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f40276d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2d8cd62897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2d8e03bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2d8e040a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2d8e041dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f2dd9adae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f2ddeb21609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f2dde8ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2d8cd62897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2d8e03bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2d8e040a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2d8e041dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f2dd9adae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f2ddeb21609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f2dde8ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2d8cd62897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f2d8dcc5119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f2dd9adae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f2ddeb21609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f2dde8ec353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank57]: Traceback (most recent call last): [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: trainer.train(dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank57]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: output = model(**micro_batch) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: sharded_logits = self.model( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank57]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank57]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank57]: pipeline_state.run_communication() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank57]: recv_activation_tensor = recv_activation() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank57]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank57]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank57]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank57]: dist.recv( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: return func(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank57]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5594abb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5595d94c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5595d99a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5595d9adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f55e1833e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f55e687a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f55e6645353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5594abb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5595d94c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5595d99a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5595d9adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f55e1833e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f55e687a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f55e6645353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5594abb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f5595a1e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f55e1833e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f55e687a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f55e6645353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84298ce897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f842aba7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f842abaca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f842abaddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f8476646e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f847b68d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f847b458353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84298ce897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f842aba7c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f842abaca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f842abaddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f8476646e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f847b68d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f847b458353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84298ce897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f842a831119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f8476646e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f847b68d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f847b458353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f83a1eef897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f83a31c8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f83a31cda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f83a31cedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f83eec67e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f83f3cae609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f83f3a79353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f83a1eef897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f83a31c8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f83a31cda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f83a31cedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f83eec67e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f83f3cae609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f83f3a79353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f83a1eef897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f83a2e52119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f83eec67e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f83f3cae609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f83f3a79353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318c49b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f318d774c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f318d779a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f318d77adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f31d9213e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f31de25a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f31de025353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318c49b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f318d774c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f318d779a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f318d77adcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f31d9213e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f31de25a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f31de025353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318c49b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f318d3fe119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f31d9213e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f31de25a609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f31de025353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f77b846c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f77b9745c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f77b974aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f77b974bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f78051e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f780a22b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f7809ff6353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f77b846c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f77b9745c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f77b974aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f77b974bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f78051e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f780a22b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f7809ff6353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f77b846c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f77b93cf119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f78051e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f780a22b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f7809ff6353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f395b628897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f395c901c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f395c906a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f395c907dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f39a83a0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f39ad3e7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f39ad1b2353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f395b628897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f395c901c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f395c906a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f395c907dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f39a83a0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f39ad3e7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f39ad1b2353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f395b628897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f395c58b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f39a83a0e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f39ad3e7609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f39ad1b2353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41496a5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f414a97ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f414a983a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f414a984dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f419641de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f419b464609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f419b22f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41496a5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f414a97ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f414a983a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f414a984dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f419641de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f419b464609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f419b22f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41496a5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f414a608119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f419641de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f419b464609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f419b22f353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1047874897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1048b4dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1048b52a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1048b53dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f10945ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f1099633609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f10993fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1047874897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1048b4dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1048b52a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1048b53dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f10945ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f1099633609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f10993fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1047874897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f10487d7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f10945ece95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f1099633609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f10993fe353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef12f79897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fef14252c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fef14257a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fef14258dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fef5fcf1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fef64d38609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fef64b03353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef12f79897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fef14252c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fef14257a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fef14258dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fef5fcf1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fef64d38609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fef64b03353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef12f79897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7fef13edc119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fef5fcf1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fef64d38609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7fef64b03353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default2]:[rank42]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb870da5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb87207ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb872083a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb872084dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fb8bdb1de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fb8c2b64609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fb8c292f353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb870da5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb87207ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb872083a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb872084dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fb8bdb1de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fb8c2b64609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fb8c292f353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb870da5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fb871d08119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fb8bdb1de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fb8c2b64609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fb8c292f353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f416360b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f41648e4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f41648e9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41648eadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f41b0383e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f41b53ca609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f41b5195353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f416360b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f41648e4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f41648e9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41648eadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f41b0383e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f41b53ca609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f41b5195353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f416360b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f416456e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f41b0383e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f41b53ca609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f41b5195353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48012a4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f480257dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4802582a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4802583dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f484e01ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4853063609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f4852e2e353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48012a4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f480257dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4802582a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4802583dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f484e01ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4853063609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f4852e2e353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48012a4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f4802207119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f484e01ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f4853063609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f4852e2e353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7fa6e6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff7fb9bfc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff7fb9c4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff7fb9c5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ff84745ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7ff84c4a5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7ff84c270353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7fa6e6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7ff7fb9bfc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff7fb9c4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff7fb9c5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7ff84745ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7ff84c4a5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7ff84c270353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff7fa6e6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7ff7fb649119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7ff84745ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7ff84c4a5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7ff84c270353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 14, last enqueued NCCL work: 15, last completed NCCL work: 13. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6355372897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f635664bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6356650a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6356651dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f63a20eae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f63a7131609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f63a6efc353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=14, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6355372897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f635664bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6356650a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6356651dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f63a20eae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f63a7131609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f63a6efc353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6355372897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f63562d5119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f63a20eae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f63a7131609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f63a6efc353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48bee51897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f48c012ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f48c012fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48c0130dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f490bbc9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f4910c10609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f49109db353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48bee51897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f48c012ac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f48c012fa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48c0130dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f490bbc9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f4910c10609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f49109db353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48bee51897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f48bfdb4119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f490bbc9e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f4910c10609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f49109db353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:[rank32]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3c83617897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3c848f0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3c848f5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3c848f6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f3cd038fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f3cd53d6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f3cd51a1353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3c83617897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3c848f0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3c848f5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3c848f6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f3cd038fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f3cd53d6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f3cd51a1353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3c83617897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f3c8457a119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f3cd038fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f3cd53d6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f3cd51a1353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f18dab64897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f18dbe3dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f18dbe42a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f18dbe43dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f19278dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f192c923609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f192c6ee353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f18dab64897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f18dbe3dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f18dbe42a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f18dbe43dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f19278dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f192c923609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f192c6ee353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f18dab64897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f18dbac7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f19278dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f192c923609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f192c6ee353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default3]:[rank35]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default3]:[rank35]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank35]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank35]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1aec3b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1aed689c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1aed68ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1aed68fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f1b39128e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f1b3e16f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f1b3df3a353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1aec3b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1aed689c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1aed68ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1aed68fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f1b39128e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f1b3e16f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f1b3df3a353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1aec3b0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f1aed313119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f1b39128e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f1b3e16f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f1b3df3a353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank33]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3aebe47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3aed120c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3aed125a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3aed126dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f3b38bbfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f3b3dc06609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f3b3d9d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3aebe47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3aed120c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3aed125a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3aed126dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f3b38bbfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f3b3dc06609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f3b3d9d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3aebe47897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f3aecdaa119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f3b38bbfe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f3b3dc06609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f3b3d9d1353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7face9cda897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7faceafb3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7faceafb8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7faceafb9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fad36a52e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fad3ba99609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fad3b864353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7face9cda897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7faceafb3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7faceafb8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7faceafb9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fad36a52e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fad3ba99609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fad3b864353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7face9cda897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7faceac3d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fad36a52e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fad3ba99609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fad3b864353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default2]:[rank34]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f5d7e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84f7057c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84f705ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84f705ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f8542af6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f8547b3d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f8547908353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f5d7e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84f7057c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84f705ca80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84f705ddcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f8542af6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f8547b3d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f8547908353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f5d7e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f84f6ce1119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f8542af6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f8547b3d609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f8547908353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efee67c4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efee7a9dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efee7aa2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efee7aa3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7eff3353ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7eff38583609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7eff3834e353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efee67c4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efee7a9dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efee7aa2a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efee7aa3dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7eff3353ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7eff38583609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7eff3834e353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efee67c4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7efee7727119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7eff3353ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7eff38583609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7eff3834e353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a702d0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a715a9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a715aea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a715afdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f5abd048e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f5ac208f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f5ac1e5a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a702d0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f5a715a9c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5a715aea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5a715afdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f5abd048e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f5ac208f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f5ac1e5a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a702d0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f5a71233119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f5abd048e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f5ac208f609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f5ac1e5a353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default1]:[rank41]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9eecc40897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9eedf19c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9eedf1ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9eedf1fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f9f399b8e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f9f3e9ff609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f9f3e7ca353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9eecc40897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9eedf19c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9eedf1ea80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9eedf1fdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f9f399b8e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f9f3e9ff609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f9f3e7ca353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9eecc40897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f9eedba3119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f9f399b8e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f9f3e9ff609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f9f3e7ca353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc55a255897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc55b52ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc55b533a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc55b534dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fc5a6fcde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fc5ac014609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fc5abddf353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc55a255897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc55b52ec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc55b533a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc55b534dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fc5a6fcde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fc5ac014609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fc5abddf353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc55a255897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fc55b1b8119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fc5a6fcde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fc5ac014609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fc5abddf353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efc603d6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efc616afc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efc616b4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efc616b5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7efcad14ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7efcb2195609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7efcb1f60353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efc603d6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efc616afc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7efc616b4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efc616b5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7efcad14ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7efcb2195609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7efcb1f60353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efc603d6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7efc61339119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7efcad14ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7efcb2195609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7efcb1f60353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default3]:[rank43]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 26, last enqueued NCCL work: 27, last completed NCCL work: 25. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dcd7ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0dcead8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0dceadda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0dceadedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f0e1a577e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f0e1f5be609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f0e1f389353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26, OpType=SEND, NumelIn=16384, NumelOut=16384, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dcd7ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0dcead8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0dceadda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0dceadedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f0e1a577e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f0e1f5be609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f0e1f389353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0dcd7ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f0dce762119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f0e1a577e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f0e1f5be609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f0e1f389353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: E0702 23:37:39.999000 139704473855808 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1754111) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 1 (local_rank: 1) exitcode : -6 (pid: 1754112) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754112 [2]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 2 (local_rank: 2) exitcode : -6 (pid: 1754113) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754113 [3]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 3 (local_rank: 3) exitcode : -6 (pid: 1754114) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754114 [4]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 4 (local_rank: 4) exitcode : -6 (pid: 1754115) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754115 [5]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 5 (local_rank: 5) exitcode : -6 (pid: 1754116) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754116 [6]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 6 (local_rank: 6) exitcode : -6 (pid: 1754117) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754117 [7]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 7 (local_rank: 7) exitcode : -6 (pid: 1754118) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754118 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:39 host : ip-26-0-160-225.ec2.internal rank : 0 (local_rank: 0) exitcode : -6 (pid: 1754111) error_file: traceback : Signal 6 (SIGABRT) received by PID 1754111 ============================================================ srun: error: ip-26-0-160-225: task 0: Exited with exit code 1 W0702 23:37:43.816000 140352219457280 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-78.ec2.internal_1119276_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.173000 139845022799616 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3870227_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.394000 139662658348800 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-103.ec2.internal_846371_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.497000 140574592993024 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_859579_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.519000 139849447196416 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-153.ec2.internal_1398027_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.548000 140415392495360 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-162-233.ec2.internal_1376764_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.623000 140576767854336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-102.ec2.internal_3742111_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:44.843000 139850683533120 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3870301 closing signal SIGTERM W0702 23:37:44.844000 139850683533120 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3870303 closing signal SIGTERM W0702 23:37:44.844000 139850683533120 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3870306 closing signal SIGTERM W0702 23:37:44.844000 139850683533120 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3870308 closing signal SIGTERM W0702 23:37:44.880000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3742186 closing signal SIGTERM W0702 23:37:44.880000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3742187 closing signal SIGTERM W0702 23:37:44.880000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3742189 closing signal SIGTERM W0702 23:37:44.880000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3742190 closing signal SIGTERM W0702 23:37:44.880000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3742193 closing signal SIGTERM W0702 23:37:44.913000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376837 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376838 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376839 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376840 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376842 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376843 closing signal SIGTERM W0702 23:37:44.914000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1376844 closing signal SIGTERM E0702 23:37:45.008000 140580253726528 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 859653) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 E0702 23:37:45.008000 139668319082304 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 846445) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:45.020000 139668319082304 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-103.ec2.internal_846371_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.021000 140580253726528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_859579_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.049000 139668319082304 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-103.ec2.internal_846371_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.058000 140580253726528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_859579_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.085000 140580253726528 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_859579_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in W0702 23:37:45.084000 139668319082304 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-103.ec2.internal_846371_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 57 (local_rank: 1) exitcode : -6 (pid: 859654) error_file: traceback : Signal 6 (SIGABRT) received by PID 859654 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 58 (local_rank: 2) exitcode : -6 (pid: 859655) error_file: traceback : Signal 6 (SIGABRT) received by PID 859655 [3]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 59 (local_rank: 3) exitcode : -6 (pid: 859656) error_file: traceback : Signal 6 (SIGABRT) received by PID 859656 [4]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 60 (local_rank: 4) exitcode : -6 (pid: 859657) error_file: traceback : Signal 6 (SIGABRT) received by PID 859657 [5]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 61 (local_rank: 5) exitcode : -6 (pid: 859658) error_file: traceback : Signal 6 (SIGABRT) received by PID 859658 [6]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 62 (local_rank: 6) exitcode : -6 (pid: 859659) error_file: traceback : Signal 6 (SIGABRT) received by PID 859659 [7]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 63 (local_rank: 7) exitcode : -6 (pid: 859660) error_file: traceback : Signal 6 (SIGABRT) received by PID 859660 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-171-88.ec2.internal rank : 56 (local_rank: 0) exitcode : -6 (pid: 859653) error_file: traceback : Signal 6 (SIGABRT) received by PID 859653 ============================================================ raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 9 (local_rank: 1) exitcode : -6 (pid: 846446) error_file: traceback : Signal 6 (SIGABRT) received by PID 846446 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 10 (local_rank: 2) exitcode : -6 (pid: 846447) error_file: traceback : Signal 6 (SIGABRT) received by PID 846447 [3]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 11 (local_rank: 3) exitcode : -6 (pid: 846448) error_file: traceback : Signal 6 (SIGABRT) received by PID 846448 [4]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 12 (local_rank: 4) exitcode : -6 (pid: 846449) error_file: traceback : Signal 6 (SIGABRT) received by PID 846449 [5]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 13 (local_rank: 5) exitcode : -6 (pid: 846450) error_file: traceback : Signal 6 (SIGABRT) received by PID 846450 [6]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 14 (local_rank: 6) exitcode : -6 (pid: 846451) error_file: traceback : Signal 6 (SIGABRT) received by PID 846451 [7]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 15 (local_rank: 7) exitcode : -6 (pid: 846452) error_file: traceback : Signal 6 (SIGABRT) received by PID 846452 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-161-103.ec2.internal rank : 8 (local_rank: 0) exitcode : -6 (pid: 846445) error_file: traceback : Signal 6 (SIGABRT) received by PID 846445 ============================================================ E0702 23:37:45.094000 140357880190784 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1119350) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:45.107000 140357880190784 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-78.ec2.internal_1119276_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. E0702 23:37:45.110000 139855107929920 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1398101) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:45.123000 139855107929920 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-153.ec2.internal_1398027_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.136000 140357880190784 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-78.ec2.internal_1119276_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.152000 139855107929920 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-153.ec2.internal_1398027_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:45.164000 140357880190784 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-78.ec2.internal_1119276_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 25 (local_rank: 1) exitcode : -6 (pid: 1119351) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119351 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 26 (local_rank: 2) exitcode : -6 (pid: 1119352) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119352 [3]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 27 (local_rank: 3) exitcode : -6 (pid: 1119353) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119353 [4]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 28 (local_rank: 4) exitcode : -6 (pid: 1119354) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119354 [5]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 29 (local_rank: 5) exitcode : -6 (pid: 1119355) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119355 [6]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 30 (local_rank: 6) exitcode : -6 (pid: 1119356) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119356 [7]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 31 (local_rank: 7) exitcode : -6 (pid: 1119357) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119357 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-161-78.ec2.internal rank : 24 (local_rank: 0) exitcode : -6 (pid: 1119350) error_file: traceback : Signal 6 (SIGABRT) received by PID 1119350 ============================================================ W0702 23:37:45.180000 139855107929920 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-153.ec2.internal_1398027_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 17 (local_rank: 1) exitcode : -6 (pid: 1398102) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398102 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 18 (local_rank: 2) exitcode : -6 (pid: 1398103) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398103 [3]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 19 (local_rank: 3) exitcode : -6 (pid: 1398104) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398104 [4]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 20 (local_rank: 4) exitcode : -6 (pid: 1398105) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398105 [5]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 21 (local_rank: 5) exitcode : -6 (pid: 1398106) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398106 [6]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 22 (local_rank: 6) exitcode : -6 (pid: 1398107) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398107 [7]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 23 (local_rank: 7) exitcode : -6 (pid: 1398108) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398108 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-161-153.ec2.internal rank : 16 (local_rank: 0) exitcode : -6 (pid: 1398101) error_file: traceback : Signal 6 (SIGABRT) received by PID 1398101 ============================================================ srun: error: ip-26-0-171-88: task 6: Exited with exit code 1 srun: error: ip-26-0-161-103: task 2: Exited with exit code 1 srun: error: ip-26-0-161-153: task 3: Exited with exit code 1 srun: error: ip-26-0-161-78: task 1: Exited with exit code 1 E0702 23:37:46.017000 139850683533120 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 3870302) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:46.029000 139850683533120 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3870227_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.059000 139850683533120 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3870227_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.077000 139850683533120 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3870227_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-171-62.ec2.internal rank : 51 (local_rank: 3) exitcode : -6 (pid: 3870304) error_file: traceback : Signal 6 (SIGABRT) received by PID 3870304 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-171-62.ec2.internal rank : 52 (local_rank: 4) exitcode : -6 (pid: 3870305) error_file: traceback : Signal 6 (SIGABRT) received by PID 3870305 [3]: time : 2024-07-02_23:37:44 host : ip-26-0-171-62.ec2.internal rank : 54 (local_rank: 6) exitcode : -6 (pid: 3870307) error_file: traceback : Signal 6 (SIGABRT) received by PID 3870307 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-171-62.ec2.internal rank : 49 (local_rank: 1) exitcode : -6 (pid: 3870302) error_file: traceback : Signal 6 (SIGABRT) received by PID 3870302 ============================================================ E0702 23:37:46.491000 140421053228864 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 1376841) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:46.504000 140421053228864 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-162-233.ec2.internal_1376764_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.536000 140421053228864 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-162-233.ec2.internal_1376764_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.546000 140421053228864 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-162-233.ec2.internal_1376764_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-162-233.ec2.internal rank : 36 (local_rank: 4) exitcode : -6 (pid: 1376841) error_file: traceback : Signal 6 (SIGABRT) received by PID 1376841 ============================================================ srun: error: ip-26-0-171-62: task 5: Exited with exit code 1 E0702 23:37:46.888000 140582428587840 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 2 (pid: 3742188) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 23:37:46.901000 140582428587840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-102.ec2.internal_3742111_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.935000 140582428587840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-102.ec2.internal_3742111_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 23:37:46.948000 140582428587840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-102.ec2.internal_3742111_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_23:37:44 host : ip-26-0-171-102.ec2.internal rank : 45 (local_rank: 5) exitcode : -6 (pid: 3742191) error_file: traceback : Signal 6 (SIGABRT) received by PID 3742191 [2]: time : 2024-07-02_23:37:44 host : ip-26-0-171-102.ec2.internal rank : 46 (local_rank: 6) exitcode : -6 (pid: 3742192) error_file: traceback : Signal 6 (SIGABRT) received by PID 3742192 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_23:37:44 host : ip-26-0-171-102.ec2.internal rank : 42 (local_rank: 2) exitcode : -6 (pid: 3742188) error_file: traceback : Signal 6 (SIGABRT) received by PID 3742188 ============================================================ srun: error: ip-26-0-162-233: task 4: Exited with exit code 1 srun: error: ip-26-0-171-102: task 7: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.