======================== START TIME: Tue Jul 2 21:57:31 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0702 21:57:34.097000 139716060084032 torch/distributed/run.py:757] W0702 21:57:34.097000 139716060084032 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.097000 139716060084032 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.097000 139716060084032 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.104000 140206314170176 torch/distributed/run.py:757] W0702 21:57:34.104000 140206314170176 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.104000 140206314170176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.104000 140206314170176 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.105000 139641247233856 torch/distributed/run.py:757] W0702 21:57:34.105000 139641247233856 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.105000 139641247233856 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.105000 139641247233856 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.104000 139702449211200 torch/distributed/run.py:757] W0702 21:57:34.104000 139702449211200 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.104000 139702449211200 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.104000 139702449211200 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.121000 140035368605504 torch/distributed/run.py:757] W0702 21:57:34.121000 140035368605504 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.121000 140035368605504 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.121000 140035368605504 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.123000 140270114883392 torch/distributed/run.py:757] W0702 21:57:34.123000 140270114883392 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.123000 140270114883392 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.123000 140270114883392 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.135000 139982293182272 torch/distributed/run.py:757] W0702 21:57:34.135000 139982293182272 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.135000 139982293182272 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.135000 139982293182272 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.138000 140494472976192 torch/distributed/run.py:757] W0702 21:57:34.138000 140494472976192 torch/distributed/run.py:757] ***************************************** W0702 21:57:34.138000 140494472976192 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 21:57:34.138000 140494472976192 torch/distributed/run.py:757] ***************************************** [default0]:07/02/2024 21:57:53 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Config: [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: run='%date_%jobid', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: seed=42, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: step=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: consumed_train_samples=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: benchmark_csv_path=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: ignore_sanity_checks=True), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: parallelism=ParallelismArgs(dp=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pp=16, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp=4, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pp_engine=, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp_mode=, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp_linear_async_communication=False, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: expert_parallel_size=1), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: eos_token_id=2, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_act='silu', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_size=2048, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: initializer_range=0.02, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: intermediate_size=4096, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: is_llama_config=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: max_position_embeddings=4096, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_attention_heads=32, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_hidden_layers=24, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_key_value_heads=32, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pad_token_id=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pretraining_tp=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rms_norm_eps=1e-05, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_scaling=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_theta=10000.0, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tie_word_embeddings=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: use_cache=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: vocab_size=50260), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: init_method=RandomInit(std=0.025), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dtype=torch.bfloat16, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: make_vocab_size_divisible_by=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: ddp_bucket_cap_mb=25), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer_revision=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer_max_length=None), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoint_interval=100000, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: save_initial_state=False, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: resume_checkpoint_path=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoints_path_is_shared_file_system=False), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: logging=LoggingArgs(log_level='info', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: log_level_replica='info', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: iteration_step_info_interval=1), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokens=TokensArgs(sequence_length=4096, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: train_steps=20, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: micro_batch_size=2, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: batch_accumulation_per_replica=512, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: val_check_interval=-1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: limit_val_batches=0, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: limit_test_batches=0), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: adam_beta1=0.9, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: adam_beta2=0.95, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: torch_adam_is_fused=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: name='adamW'), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: zero_stage=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: weight_decay=0.01, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: clip_grad=1.0, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: accumulate_grad_in_fp32=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_warmup_steps=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_warmup_style='linear', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_style='linear', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_steps=19, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_starting_step=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: min_decay_lr=1e-05)), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: start_training_step=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hf_dataset_splits='train', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hf_dataset_config_name=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dataset_processing_num_proc_per_process=64, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dataset_overwrite_cache=False, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: text_column_name='text'), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: seed=42, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_loading_workers=0))], [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-4_pp-16_mbz-2')), [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lighteval=None) [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Model Config: [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: LlamaConfig(bos_token_id=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: eos_token_id=2, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_act='silu', [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_size=2048, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: initializer_range=0.02, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: intermediate_size=4096, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: is_llama_config=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: max_position_embeddings=4096, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_attention_heads=32, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_hidden_layers=24, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_key_value_heads=32, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pad_token_id=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pretraining_tp=1, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rms_norm_eps=1e-05, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_scaling=None, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_theta=10000.0, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tie_word_embeddings=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: use_cache=True, [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: vocab_size=50260) [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Building model.. [default0]:07/02/2024 21:57:53 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Setting PP block ranks... [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-178]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-178]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-178]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=0|ip-26-0-172-73]: Local number of parameters: 25.7M (49.09MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=0|ip-26-0-172-73]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=0|ip-26-0-172-73]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=0|ip-26-0-172-73]: Local number of parameters: 0 (0.00MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=0|ip-26-0-172-73]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=0|ip-26-0-172-73]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=3|ip-26-0-172-73]: Local number of parameters: 0 (0.00MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=3|ip-26-0-172-73]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=3|ip-26-0-172-73]: No checkpoint path provided. [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=2|ip-26-0-172-73]: Local number of parameters: 0 (0.00MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=2|ip-26-0-172-73]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=2|ip-26-0-172-73]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-178]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-178]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-178]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: Local number of parameters: 21M (40.02MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: Local number of parameters: 21M (40.02MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=1|ip-26-0-169-86]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=1|ip-26-0-169-86]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=1|ip-26-0-169-86]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-178]: Local number of parameters: 21M (40.02MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-178]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-178]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-178]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-178]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=2|ip-26-0-172-73]: Local number of parameters: 25.7M (49.09MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=2|ip-26-0-172-73]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=2|ip-26-0-172-73]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-178]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=3|ip-26-0-169-86]: Local number of parameters: 21M (40.02MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=3|ip-26-0-169-86]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=3|ip-26-0-169-86]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=1|ip-26-0-172-73]: Local number of parameters: 25.7M (49.09MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=1|ip-26-0-172-73]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=1|ip-26-0-172-73]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=1|ip-26-0-172-73]: Local number of parameters: 0 (0.00MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=1|ip-26-0-172-73]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=15|TP=1|ip-26-0-172-73]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=3|ip-26-0-168-238]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=2|ip-26-0-168-238]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=2|ip-26-0-168-238]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=2|ip-26-0-168-238]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=3|ip-26-0-168-238]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=3|ip-26-0-168-238]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=0|ip-26-0-168-238]: Local number of parameters: 21M (40.02MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=1|ip-26-0-169-86]: Local number of parameters: 21M (40.02MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=1|ip-26-0-169-86]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=1|ip-26-0-169-86]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=1|ip-26-0-168-238]: Local number of parameters: 21M (40.02MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=1|ip-26-0-168-238]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=9|TP=1|ip-26-0-168-238]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-86]: Local number of parameters: 21M (40.02MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-226]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-226]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-226]: No checkpoint path provided. [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=3|ip-26-0-169-86]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=3|ip-26-0-169-86]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=3|ip-26-0-169-86]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=3|ip-26-0-168-238]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=0|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=0|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=0|ip-26-0-172-57]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=3|ip-26-0-168-238]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=3|ip-26-0-168-238]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=0|ip-26-0-168-238]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-226]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=2|ip-26-0-169-86]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-226]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-226]: No checkpoint path provided. [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=2|ip-26-0-169-86]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=2|ip-26-0-169-86]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-226]: Local number of parameters: 21M (40.02MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-226]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-226]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=2|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=2|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=2|ip-26-0-172-57]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=3|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=3|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=1|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=1|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=1|ip-26-0-172-57]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=12|TP=3|ip-26-0-172-57]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=2|ip-26-0-168-238]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=2|ip-26-0-168-238]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=2|ip-26-0-168-238]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=2|ip-26-0-169-86]: Local number of parameters: 21M (40.02MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=2|ip-26-0-169-86]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-192]: Local number of parameters: 46.7M (89.10MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-192]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=2|ip-26-0-160-192]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-192]: Local number of parameters: 21M (40.02MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=10|TP=2|ip-26-0-169-86]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-226]: Local number of parameters: 21M (40.02MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-192]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Total number of parameters: 1.21G (2313.42MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Local number of parameters: 46.7M (89.10MiB) [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default0]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Parametrizing model parameters using StandardParametrizator [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-226]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: Local number of parameters: 46.7M (89.10MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-226]: Local number of parameters: 21M (40.02MiB) [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-226]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-226]: No checkpoint path provided. [default2]:07/02/2024 21:58:09 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-226]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-226]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-226]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-226]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-192]: Local number of parameters: 46.7M (89.10MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-192]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=0|TP=3|ip-26-0-160-192]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=3|ip-26-0-160-192]: Local number of parameters: 21M (40.02MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=3|ip-26-0-160-192]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-86]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=3|ip-26-0-160-192]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=1|ip-26-0-160-192]: Local number of parameters: 21M (40.02MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=1|ip-26-0-160-192]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=3|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=3|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=3|ip-26-0-172-57]: No checkpoint path provided. [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=1|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=1|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=1|ip-26-0-172-57]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=0|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=2|ip-26-0-160-192]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=2|ip-26-0-160-192]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=1|TP=2|ip-26-0-160-192]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=0|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=0|ip-26-0-172-57]: No checkpoint path provided. [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-226]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-226]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-226]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-226]: No checkpoint path provided. [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-226]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/02/2024 21:58:09 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-226]: No checkpoint path provided. [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=2|ip-26-0-172-57]: Local number of parameters: 21M (40.02MiB) [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=2|ip-26-0-172-57]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/02/2024 21:58:09 [INFO|DP=0|PP=13|TP=2|ip-26-0-172-57]: No checkpoint path provided. [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=1|ip-26-0-168-238]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=1|ip-26-0-168-238]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/02/2024 21:58:09 [INFO|DP=0|PP=8|TP=1|ip-26-0-168-238]: No checkpoint path provided. [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=3|ip-26-0-172-73]: Local number of parameters: 25.7M (49.09MiB) [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=3|ip-26-0-172-73]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default3]:07/02/2024 21:58:09 [INFO|DP=0|PP=14|TP=3|ip-26-0-172-73]: No checkpoint path provided. [default0]:07/02/2024 21:58:11 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/02/2024 21:58:11 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/02/2024 21:58:11 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 0 has 46.7M out of 46.7M (100.00%) params' optimizer states [default0]:07/02/2024 21:58:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/02/2024 21:58:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Using `datasets` library [default0]:07/02/2024 21:58:12 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/02/2024 21:58:12 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Training Plan] There are 1 training stages [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Stage Training Stage] start from step 1 [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Start training] datetime: 2024-07-02 21:58:13.121800 | mbs: 2 | grad_accum: 512 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/02/2024 21:58:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Memory usage: 448.42MiB. Peak allocated 448.42MiB. Peak reserved: 456.00MiB [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=3|TP=2|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=2|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=15|TP=0|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=2|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=14|TP=1|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=10|TP=1|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=8|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=7|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=12|TP=0|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=11|TP=2|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=12|TP=2|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=5|TP=1|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=4|TP=2|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=0|TP=2|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=13|TP=1|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=13|TP=3|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=0|TP=3|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=1|TP=2|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=13|TP=0|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=8|TP=1|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=14|TP=0|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=15|TP=3|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=15|TP=2|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=2|TP=2|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=3|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=3|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=3|TP=3|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=2|TP=3|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=14|TP=2|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=11|TP=1|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=10|TP=3|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=6|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=6|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=9|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=7|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=9|TP=3|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=7|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=6|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=6|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=5|TP=2|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=9|TP=1|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=7|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=10|TP=0|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=11|TP=3|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=8|TP=2|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 21:58:13 [WARNING|DP=0|PP=4|TP=0|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=8|TP=3|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=4|TP=1|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=12|TP=3|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=12|TP=1|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=4|TP=3|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 21:58:13 [WARNING|DP=0|PP=0|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=1|TP=3|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=11|TP=0|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=1|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 21:58:13 [WARNING|DP=0|PP=5|TP=3|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=5|TP=0|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=13|TP=2|ip-26-0-172-57]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 21:58:13 [WARNING|DP=0|PP=14|TP=3|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 21:58:13 [WARNING|DP=0|PP=15|TP=1|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 21:58:13 [WARNING|DP=0|PP=10|TP=2|ip-26-0-169-86]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 21:58:13 [WARNING|DP=0|PP=1|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 21:58:13 [WARNING|DP=0|PP=9|TP=2|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=178, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=178, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=154, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=154, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=154, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=154, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=178, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=178, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank5]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank5]: grad_accumulator.backward(sum(activations)) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank5]: result = loss.backward() [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank5]: torch.autograd.backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank5]: _engine_run_backward( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank5]: return user_fn(self, *args) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank5]: send_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank5]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank5]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank5]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank5]: dist.send( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank5]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank14]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank14]: grad_accumulator.backward(sum(activations)) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank14]: result = loss.backward() [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank14]: torch.autograd.backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank14]: _engine_run_backward( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank14]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank14]: return user_fn(self, *args) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank14]: send_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank14]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank14]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank14]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank14]: dist.send( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank14]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank4]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank4]: grad_accumulator.backward(sum(activations)) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank4]: result = loss.backward() [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank4]: torch.autograd.backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank4]: _engine_run_backward( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank4]: return user_fn(self, *args) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank4]: send_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank4]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank4]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank4]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank4]: dist.send( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank4]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank6]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank6]: grad_accumulator.backward(sum(activations)) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank6]: result = loss.backward() [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank6]: torch.autograd.backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank6]: _engine_run_backward( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank6]: return user_fn(self, *args) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank6]: send_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank6]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank6]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank6]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank6]: dist.send( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank6]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank15]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank15]: grad_accumulator.backward(sum(activations)) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank15]: result = loss.backward() [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank15]: torch.autograd.backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank15]: _engine_run_backward( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank15]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank15]: return user_fn(self, *args) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank15]: send_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank15]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank15]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank15]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank15]: dist.send( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank15]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank13]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank13]: grad_accumulator.backward(sum(activations)) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank13]: result = loss.backward() [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank13]: torch.autograd.backward( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank13]: _engine_run_backward( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank13]: return user_fn(self, *args) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank13]: send_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank13]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank13]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank13]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank13]: dist.send( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank13]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank12]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank12]: grad_accumulator.backward(sum(activations)) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank12]: result = loss.backward() [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank12]: torch.autograd.backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank12]: _engine_run_backward( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank12]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank12]: return user_fn(self, *args) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank12]: send_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank12]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank12]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank12]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank12]: dist.send( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank12]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank7]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank7]: grad_accumulator.backward(sum(activations)) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank7]: result = loss.backward() [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank7]: torch.autograd.backward( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank7]: _engine_run_backward( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank7]: return user_fn(self, *args) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank7]: send_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank7]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank7]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank7]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank7]: dist.send( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank7]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank23]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank23]: grad_accumulator.backward(sum(activations)) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank23]: result = loss.backward() [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank23]: torch.autograd.backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank23]: _engine_run_backward( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank23]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank23]: return user_fn(self, *args) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank23]: pipeline_state.run_communication() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank23]: send_activation() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank23]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank23]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank23]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank23]: dist.send( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank23]: return func(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank23]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank23]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank22]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank22]: grad_accumulator.backward(sum(activations)) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank22]: result = loss.backward() [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank22]: torch.autograd.backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank22]: _engine_run_backward( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank22]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank22]: return user_fn(self, *args) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank22]: pipeline_state.run_communication() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank22]: send_activation() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank22]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank22]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank22]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank22]: dist.send( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank22]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank22]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank22]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank21]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank21]: grad_accumulator.backward(sum(activations)) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank21]: result = loss.backward() [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank21]: torch.autograd.backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank21]: _engine_run_backward( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank21]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank21]: return user_fn(self, *args) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank21]: pipeline_state.run_communication() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank21]: send_activation() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank21]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank21]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank21]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank21]: dist.send( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: return func(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank21]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank20]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank20]: grad_accumulator.backward(sum(activations)) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank20]: result = loss.backward() [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank20]: torch.autograd.backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank20]: _engine_run_backward( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank20]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank20]: return user_fn(self, *args) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank20]: send_activation() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank20]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank20]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank20]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank20]: dist.send( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank20]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank20]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default6]:[rank38]: Traceback (most recent call last): [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank38]: trainer.train(dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank38]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank38]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank38]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank38]: grad_accumulator.backward(sum(activations)) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank38]: result = loss.backward() [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank38]: torch.autograd.backward( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank38]: _engine_run_backward( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank38]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank38]: return user_fn(self, *args) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank38]: pipeline_state.run_communication() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank38]: send_activation() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank38]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank38]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank38]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank38]: dist.send( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank38]: return func(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank38]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank38]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank37]: Traceback (most recent call last): [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank37]: trainer.train(dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank37]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank37]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank37]: grad_accumulator.backward(sum(activations)) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank37]: result = loss.backward() [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank37]: torch.autograd.backward( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank37]: _engine_run_backward( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank37]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank37]: return user_fn(self, *args) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank37]: pipeline_state.run_communication() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank37]: send_activation() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank37]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank37]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank37]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank37]: dist.send( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: return func(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank37]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank37]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank31]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank31]: trainer.train(dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank31]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank31]: grad_accumulator.backward(sum(activations)) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank31]: result = loss.backward() [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank31]: torch.autograd.backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank31]: _engine_run_backward( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank31]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank31]: return user_fn(self, *args) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank31]: pipeline_state.run_communication() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank31]: send_activation() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank31]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank31]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank31]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank31]: dist.send( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: return func(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank31]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank31]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank28]: trainer.train(dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank28]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank28]: grad_accumulator.backward(sum(activations)) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank28]: result = loss.backward() [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank28]: torch.autograd.backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank28]: _engine_run_backward( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank28]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank28]: return user_fn(self, *args) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank28]: pipeline_state.run_communication() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank28]: send_activation() [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank28]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank28]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank28]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank28]: dist.send( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank28]: return func(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank28]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank28]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default6]:[rank30]: Traceback (most recent call last): [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: trainer.train(dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank30]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank30]: grad_accumulator.backward(sum(activations)) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank30]: result = loss.backward() [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank30]: torch.autograd.backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank30]: _engine_run_backward( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank30]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank30]: return user_fn(self, *args) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank30]: pipeline_state.run_communication() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default6]:[rank30]: send_activation() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default6]:[rank30]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default6]:[rank30]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default6]:[rank30]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default6]:[rank30]: dist.send( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank30]: return func(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default6]:[rank30]: group.send([tensor], group_dst_rank, tag).wait() [default6]:[rank30]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank29]: Traceback (most recent call last): [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank29]: trainer.train(dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank29]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank29]: grad_accumulator.backward(sum(activations)) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank29]: result = loss.backward() [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank29]: torch.autograd.backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank29]: _engine_run_backward( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank29]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank29]: return user_fn(self, *args) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank29]: pipeline_state.run_communication() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default5]:[rank29]: send_activation() [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default5]:[rank29]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default5]:[rank29]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default5]:[rank29]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default5]:[rank29]: dist.send( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank29]: return func(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default5]:[rank29]: group.send([tensor], group_dst_rank, tag).wait() [default5]:[rank29]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank39]: Traceback (most recent call last): [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank39]: trainer.train(dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank39]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank39]: grad_accumulator.backward(sum(activations)) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank39]: result = loss.backward() [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank39]: torch.autograd.backward( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank39]: _engine_run_backward( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank39]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank39]: return user_fn(self, *args) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank39]: pipeline_state.run_communication() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default7]:[rank39]: send_activation() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default7]:[rank39]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default7]:[rank39]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default7]:[rank39]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default7]:[rank39]: dist.send( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: return func(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default7]:[rank39]: group.send([tensor], group_dst_rank, tag).wait() [default7]:[rank39]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default4]:[rank36]: Traceback (most recent call last): [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank36]: trainer.train(dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank36]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank36]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank36]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank36]: grad_accumulator.backward(sum(activations)) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank36]: result = loss.backward() [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank36]: torch.autograd.backward( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank36]: _engine_run_backward( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank36]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank36]: return user_fn(self, *args) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank36]: pipeline_state.run_communication() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default4]:[rank36]: send_activation() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default4]:[rank36]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default4]:[rank36]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default4]:[rank36]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default4]:[rank36]: dist.send( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank36]: return func(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default4]:[rank36]: group.send([tensor], group_dst_rank, tag).wait() [default4]:[rank36]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [default6]:[rank54]: Traceback (most recent call last): [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank54]: trainer.train(dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank54]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank54]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank54]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: output = model(**micro_batch) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: sharded_logits = self.model( [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank54]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank54]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank54]: pipeline_state.run_communication() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: recv_activation_tensor = recv_activation() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank54]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank54]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank54]: dist.recv( [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank54]: return func(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank54]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank55]: Traceback (most recent call last): [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank55]: trainer.train(dataloader) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank55]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank55]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank55]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank55]: output = model(**micro_batch) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank55]: sharded_logits = self.model( [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank55]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank55]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank55]: pipeline_state.run_communication() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: recv_activation_tensor = recv_activation() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank55]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank55]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank55]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank55]: dist.recv( [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: return func(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank52]: Traceback (most recent call last): [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank52]: trainer.train(dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank52]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank52]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank52]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank52]: output = model(**micro_batch) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank52]: sharded_logits = self.model( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank52]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank52]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank52]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank52]: pipeline_state.run_communication() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: recv_activation_tensor = recv_activation() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank52]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank52]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank52]: dist.recv( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank52]: return func(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank52]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank52]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank53]: Traceback (most recent call last): [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: sharded_logits = self.model( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank53]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank53]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: pipeline_state.run_communication() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank53]: recv_activation_tensor = recv_activation() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank53]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank53]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank53]: dist.recv( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank53]: return func(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank53]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank53]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank62]: Traceback (most recent call last): [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank62]: trainer.train(dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank62]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank62]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank62]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank62]: output = model(**micro_batch) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank62]: sharded_logits = self.model( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank62]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default6]:[rank62]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank62]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank62]: pipeline_state.run_communication() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank62]: recv_activation_tensor = recv_activation() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank62]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank62]: dist.recv( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank62]: return func(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank62]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank62]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank61]: Traceback (most recent call last): [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank61]: trainer.train(dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank61]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank61]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank61]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank61]: output = model(**micro_batch) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank61]: sharded_logits = self.model( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank61]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default5]:[rank61]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank61]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank61]: pipeline_state.run_communication() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank61]: recv_activation_tensor = recv_activation() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank61]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank61]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank61]: dist.recv( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank61]: return func(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank63]: Traceback (most recent call last): [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank63]: trainer.train(dataloader) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank63]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank63]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank63]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank63]: output = model(**micro_batch) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: sharded_logits = self.model( [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank63]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default7]:[rank63]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank63]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank63]: pipeline_state.run_communication() [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank63]: recv_activation_tensor = recv_activation() [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank63]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank63]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank63]: dist.recv( [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: return func(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank63]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank63]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank60]: Traceback (most recent call last): [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: trainer.train(dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank60]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank60]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank60]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank60]: output = model(**micro_batch) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: sharded_logits = self.model( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank60]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default4]:[rank60]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank60]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank60]: pipeline_state.run_communication() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank60]: recv_activation_tensor = recv_activation() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank60]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank60]: dist.recv( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: return func(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank60]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank56]: Traceback (most recent call last): [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank56]: trainer.train(dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank56]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank56]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank56]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank56]: output = model(**micro_batch) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank56]: sharded_logits = self.model( [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank56]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default0]:[rank56]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank56]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank56]: pipeline_state.run_communication() [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank56]: recv_activation_tensor = recv_activation() [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank56]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank56]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank56]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank56]: dist.recv( [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank56]: return func(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank56]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank56]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank58]: Traceback (most recent call last): [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank58]: trainer.train(dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank58]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank58]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank58]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank58]: output = model(**micro_batch) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank58]: sharded_logits = self.model( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank58]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default2]:[rank58]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank58]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: pipeline_state.run_communication() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank58]: recv_activation_tensor = recv_activation() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank58]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank58]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank58]: dist.recv( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: return func(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank58]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank58]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=62, OpType=SEND, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=62, OpType=SEND, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [default1]:[rank57]: Traceback (most recent call last): [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: trainer.train(dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank57]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: output = model(**micro_batch) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: sharded_logits = self.model( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default1]:[rank57]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank57]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank57]: pipeline_state.run_communication() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank57]: recv_activation_tensor = recv_activation() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank57]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank57]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank57]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank57]: dist.recv( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: return func(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank57]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=62, OpType=SEND, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600098 milliseconds before timing out. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=62, OpType=SEND, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [default3]:[rank59]: Traceback (most recent call last): [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank59]: trainer.train(dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank59]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank59]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank59]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: output = model(**micro_batch) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank59]: sharded_logits = self.model( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default3]:[rank59]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank59]: pipeline_state.run_communication() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: recv_activation_tensor = recv_activation() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank59]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank59]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank59]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank59]: dist.recv( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank59]: return func(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank59]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank59]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 15] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank62]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d0431b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7d055f4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7d055f9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d055fadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f7d51093e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f7d560da609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f7d55ea5353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d0431b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f7d055f4c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7d055f9a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d055fadcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f7d51093e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f7d560da609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f7d55ea5353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d0431b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f7d0527e119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f7d51093e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f7d560da609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f7d55ea5353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 15] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank60]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742c4d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f742d7b1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f742d7b6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f742d7b7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f7479250e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f747e297609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f747e062353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742c4d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f742d7b1c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f742d7b6a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f742d7b7dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f7479250e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f747e297609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f747e062353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f742c4d8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f742d43b119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f7479250e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f747e297609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f747e062353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default3]:[rank51]: Traceback (most recent call last): [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank51]: trainer.train(dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank51]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank51]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank51]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank51]: output = model(**micro_batch) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank51]: sharded_logits = self.model( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank51]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank51]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank51]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: pipeline_state.run_communication() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank51]: recv_activation_tensor = recv_activation() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank51]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank51]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank51]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank51]: dist.recv( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank51]: return func(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank51]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank46]: Traceback (most recent call last): [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank46]: trainer.train(dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank46]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default6]:[rank46]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default6]:[rank46]: grad_accumulator.backward(sum(activations)) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default6]:[rank46]: result = loss.backward() [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default6]:[rank46]: torch.autograd.backward( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default6]:[rank46]: _engine_run_backward( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default6]:[rank46]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default6]:[rank46]: return user_fn(self, *args) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default6]:[rank46]: pipeline_state.run_communication() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default6]:[rank46]: self.grads_buffer.append(recv_grad()) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default6]:[rank46]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank46]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank46]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank46]: dist.recv( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank46]: return func(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank46]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank46]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank47]: Traceback (most recent call last): [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank47]: trainer.train(dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank47]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default7]:[rank47]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default7]:[rank47]: grad_accumulator.backward(sum(activations)) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default7]:[rank47]: result = loss.backward() [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default7]:[rank47]: torch.autograd.backward( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default7]:[rank47]: _engine_run_backward( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default7]:[rank47]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default7]:[rank47]: return user_fn(self, *args) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default7]:[rank47]: pipeline_state.run_communication() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default7]:[rank47]: self.grads_buffer.append(recv_grad()) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default7]:[rank47]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank47]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank47]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank47]: dist.recv( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank47]: return func(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank47]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank49]: Traceback (most recent call last): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank49]: trainer.train(dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank49]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank49]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank49]: output = model(**micro_batch) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank49]: sharded_logits = self.model( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank49]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank49]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank49]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank49]: recv_activation_tensor = recv_activation() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank49]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank49]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank49]: dist.recv( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank49]: return func(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank49]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank48]: Traceback (most recent call last): [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: trainer.train(dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank48]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank48]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank48]: output = model(**micro_batch) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank48]: sharded_logits = self.model( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank48]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank48]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank48]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank48]: pipeline_state.run_communication() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank48]: recv_activation_tensor = recv_activation() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank48]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank48]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank48]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank48]: dist.recv( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank48]: return func(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank48]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank45]: Traceback (most recent call last): [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: trainer.train(dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank45]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default5]:[rank45]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default5]:[rank45]: grad_accumulator.backward(sum(activations)) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default5]:[rank45]: result = loss.backward() [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default5]:[rank45]: torch.autograd.backward( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default5]:[rank45]: _engine_run_backward( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default5]:[rank45]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default5]:[rank45]: return user_fn(self, *args) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default5]:[rank45]: pipeline_state.run_communication() [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default5]:[rank45]: self.grads_buffer.append(recv_grad()) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default5]:[rank45]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank45]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank45]: dist.recv( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: return func(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank45]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 13] Timeout at NCCL work: 48, last enqueued NCCL work: 48, last completed NCCL work: 47. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank55]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f53e5c98897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f53e6f71c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f53e6f76a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f53e6f77dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f5432a10e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f5437a57609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f5437822353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f53e5c98897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f53e6f71c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f53e6f76a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f53e6f77dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f5432a10e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f5437a57609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f5437822353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f53e5c98897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f53e6bfb119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f5432a10e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f5437a57609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f5437822353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default4]:[rank44]: Traceback (most recent call last): [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank44]: trainer.train(dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank44]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank44]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default4]:[rank44]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default4]:[rank44]: grad_accumulator.backward(sum(activations)) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default4]:[rank44]: result = loss.backward() [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default4]:[rank44]: torch.autograd.backward( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default4]:[rank44]: _engine_run_backward( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default4]:[rank44]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default4]:[rank44]: return user_fn(self, *args) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default4]:[rank44]: pipeline_state.run_communication() [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 172, in run_communication [default4]:[rank44]: self.grads_buffer.append(recv_grad()) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 50, in __call__ [default4]:[rank44]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank44]: dist.recv( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank44]: return func(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank44]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank50]: Traceback (most recent call last): [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank50]: trainer.train(dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank50]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank50]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank50]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank50]: output = model(**micro_batch) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: sharded_logits = self.model( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank50]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank50]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank50]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank50]: pipeline_state.run_communication() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank50]: recv_activation_tensor = recv_activation() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank50]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank50]: dist.recv( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank50]: return func(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank50]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank50]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 15] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank61]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc5ddff2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc5df2cbc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc5df2d0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc5df2d1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fc62ad6ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fc62fdb1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fc62fb7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc5ddff2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc5df2cbc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc5df2d0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc5df2d1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fc62ad6ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fc62fdb1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fc62fb7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc5ddff2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7fc5def55119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7fc62ad6ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7fc62fdb1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7fc62fb7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 14] Timeout at NCCL work: 42, last enqueued NCCL work: 42, last completed NCCL work: 41. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d2d1f1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0d2e4cac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0d2e4cfa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0d2e4d0dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f0d79f69e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f0d7efb0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f0d7ed7b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d2d1f1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f0d2e4cac62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0d2e4cfa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0d2e4d0dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f0d79f69e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f0d7efb0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f0d7ed7b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d2d1f1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f0d2e154119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f0d79f69e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f0d7efb0609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f0d7ed7b353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 14] Timeout at NCCL work: 42, last enqueued NCCL work: 42, last completed NCCL work: 41. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank58]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b9d7ec897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3b9eac5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3b9eacaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b9eacbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f3bea564e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f3bef5ab609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f3bef376353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b9d7ec897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f3b9eac5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3b9eacaa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b9eacbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f3bea564e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f3bef5ab609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f3bef376353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b9d7ec897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f3b9e74f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f3bea564e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f3bef5ab609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f3bef376353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 15] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank63]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8dde2aa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8ddf583c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8ddf588a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ddf589dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f8e2b022e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f8e30069609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f8e2fe34353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8dde2aa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8ddf583c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8ddf588a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ddf589dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7f8e2b022e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7f8e30069609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7f8e2fe34353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8dde2aa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7f8ddf20d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7f8e2b022e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7f8e30069609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7f8e2fe34353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 14] Timeout at NCCL work: 42, last enqueued NCCL work: 42, last completed NCCL work: 41. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank57]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f786e4e5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f786f7bec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f786f7c3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f786f7c4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f78bb25de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f78c02a4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f78c006f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f786e4e5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f786f7bec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f786f7c3a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f786f7c4dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f78bb25de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f78c02a4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f78c006f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f786e4e5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f786f448119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f78bb25de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f78c02a4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f78c006f353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 14] Timeout at NCCL work: 42, last enqueued NCCL work: 42, last completed NCCL work: 41. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank59]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8da8f52897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8daa22bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8daa230a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8daa231dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f8df5ccae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f8dfad11609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f8dfaadc353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8da8f52897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f8daa22bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8daa230a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8daa231dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f8df5ccae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f8dfad11609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f8dfaadc353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8da8f52897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f8da9eb5119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f8df5ccae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f8dfad11609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f8dfaadc353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 12] Timeout at NCCL work: 59, last enqueued NCCL work: 60, last completed NCCL work: 58. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank51]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f65460ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f65473d8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f65473dda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f65473dedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f6592e77e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f6597ebe609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f6597c89353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f65460ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f65473d8c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f65473dda80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f65473dedcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f6592e77e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f6597ebe609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f6597c89353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f65460ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f6547062119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f6592e77e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f6597ebe609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f6597c89353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 13] Timeout at NCCL work: 48, last enqueued NCCL work: 48, last completed NCCL work: 47. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank53]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef61a7d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fef62d56c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fef62d5ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fef62d5cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fefae7f5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fefb383c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fefb3607353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef61a7d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fef62d56c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fef62d5ba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fef62d5cdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7fefae7f5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7fefb383c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7fefb3607353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef61a7d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7fef629e0119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7fefae7f5e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7fefb383c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7fefb3607353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 13] Timeout at NCCL work: 48, last enqueued NCCL work: 48, last completed NCCL work: 47. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank52]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd8b157a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd8b2853c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd8b2858a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd8b2859dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fd8fe2f2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fd903339609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fd903104353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd8b157a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fd8b2853c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd8b2858a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd8b2859dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7fd8fe2f2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7fd903339609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7fd903104353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd8b157a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7fd8b24dd119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7fd8fe2f2e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7fd903339609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7fd903104353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 12] Timeout at NCCL work: 59, last enqueued NCCL work: 60, last completed NCCL work: 58. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank49]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33571a2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f335847bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3358480a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3358481dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f33a3f1ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f33a8f61609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f33a8d2c353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33571a2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f335847bc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3358480a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3358481dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f33a3f1ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f33a8f61609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f33a8d2c353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33571a2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f3358105119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f33a3f1ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f33a8f61609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f33a8d2c353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 12] Timeout at NCCL work: 59, last enqueued NCCL work: 60, last completed NCCL work: 58. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank48]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74e92da897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f74ea5b3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74ea5b8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74ea5b9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f7536052e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f753b099609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f753ae64353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74e92da897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f74ea5b3c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74ea5b8a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74ea5b9dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f7536052e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f753b099609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f753ae64353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74e92da897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f74ea23d119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f7536052e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f753b099609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f753ae64353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 13] Timeout at NCCL work: 48, last enqueued NCCL work: 48, last completed NCCL work: 47. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank54]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb17c75c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb17da35c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb17da3aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb17da3bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fb1c94d4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fb1ce51b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fb1ce2e6353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb17c75c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb17da35c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb17da3aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb17da3bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fb1c94d4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fb1ce51b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fb1ce2e6353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb17c75c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7fb17d6bf119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fb1c94d4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fb1ce51b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fb1ce2e6353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 12] Timeout at NCCL work: 59, last enqueued NCCL work: 60, last completed NCCL work: 58. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank50]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb9661dc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb9674b5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb9674baa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9674bbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fb9b2f54e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fb9b7f9b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fb9b7d66353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59, OpType=SEND, NumelIn=4194304, NumelOut=4194304, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb9661dc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fb9674b5c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb9674baa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9674bbdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fb9b2f54e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fb9b7f9b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fb9b7d66353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb9661dc897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fb96713f119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fb9b2f54e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fb9b7f9b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fb9b7d66353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: W0702 22:09:35.984000 139716060084032 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 846892 closing signal SIGTERM W0702 22:09:35.984000 139716060084032 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 846893 closing signal SIGTERM W0702 22:09:35.984000 139716060084032 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 846895 closing signal SIGTERM E0702 22:09:36.666000 139716060084032 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 2 (pid: 846894) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:09:35 host : ip-26-0-172-73.ec2.internal rank : 60 (local_rank: 4) exitcode : -6 (pid: 846896) error_file: traceback : Signal 6 (SIGABRT) received by PID 846896 [2]: time : 2024-07-02_22:09:35 host : ip-26-0-172-73.ec2.internal rank : 61 (local_rank: 5) exitcode : -6 (pid: 846897) error_file: traceback : Signal 6 (SIGABRT) received by PID 846897 [3]: time : 2024-07-02_22:09:35 host : ip-26-0-172-73.ec2.internal rank : 62 (local_rank: 6) exitcode : -6 (pid: 846898) error_file: traceback : Signal 6 (SIGABRT) received by PID 846898 [4]: time : 2024-07-02_22:09:35 host : ip-26-0-172-73.ec2.internal rank : 63 (local_rank: 7) exitcode : -6 (pid: 846899) error_file: traceback : Signal 6 (SIGABRT) received by PID 846899 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:09:35 host : ip-26-0-172-73.ec2.internal rank : 58 (local_rank: 2) exitcode : -6 (pid: 846894) error_file: traceback : Signal 6 (SIGABRT) received by PID 846894 ============================================================ srun: error: ip-26-0-172-73: task 7: Exited with exit code 1 E0702 22:09:41.196000 140206314170176 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 1004628) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 49 (local_rank: 1) exitcode : -6 (pid: 1004629) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004629 [2]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 50 (local_rank: 2) exitcode : -6 (pid: 1004630) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004630 [3]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 51 (local_rank: 3) exitcode : -6 (pid: 1004631) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004631 [4]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 52 (local_rank: 4) exitcode : -6 (pid: 1004632) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004632 [5]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 53 (local_rank: 5) exitcode : -6 (pid: 1004633) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004633 [6]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 54 (local_rank: 6) exitcode : -6 (pid: 1004634) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004634 [7]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 55 (local_rank: 7) exitcode : -6 (pid: 1004635) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004635 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:09:41 host : ip-26-0-172-57.ec2.internal rank : 48 (local_rank: 0) exitcode : -6 (pid: 1004628) error_file: traceback : Signal 6 (SIGABRT) received by PID 1004628 ============================================================ srun: error: ip-26-0-172-57: task 6: Exited with exit code 1 [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 29] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 30] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 29] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 30] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 28] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 28] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 13] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 13] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 12] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 12] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 4] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 4] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 22] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 22] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 21] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 21] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 7] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 14] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 14] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 20] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 20] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 23] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 23] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 15] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 15] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 5] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 45] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 45] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 6] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 6] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 31] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 31] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 44] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 44] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 47] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 47] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 46] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 46] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 39] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 39] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 37] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 37] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 36] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 36] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 38] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 38] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 3] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank12]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 3] [PG 4 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 3] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank15]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 3] [PG 4 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 3] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank13]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 3] [PG 4 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 11] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank46]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 11] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank46]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 11] [PG 4 Rank 11] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank22]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 5] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank22]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 5] [PG 4 Rank 5] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, pl[default4]:[rank4]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank4]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 1] ProcessGroupNCCL preparing to dump debug info. ease attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 11] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank45]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 11] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank45]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 11] [PG 4 Rank 11] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank5]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 1] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank5]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 1] [PG 4 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default4]:[rank4]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 1] [PG 4 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 9] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank39]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 9] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank39]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 9] [PG 4 Rank 9] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank6]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 1] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank6]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 1] [PG 4 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank21]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 5] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank21]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 5] [PG 4 Rank 5] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 9] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank38]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 9] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank38]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 9] [PG 4 Rank 9] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank31]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 7] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank31]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 7] [PG 4 Rank 7] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 9] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank37]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 9] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank37]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 9] [PG 4 Rank 9] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank30]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 7] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank30]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 7] [PG 4 Rank 7] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default5]:[rank29]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 7] ProcessGroupNCCL preparing to dump debug info. [default5]:[rank29]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 7] [PG 4 Rank 7] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank7]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 1] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank7]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 1] [PG 4 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 11] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank44]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 11] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank44]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 11] [PG 4 Rank 11] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 9] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank36]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 9] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank36]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 9] [PG 4 Rank 9] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 3] ProcessGroupNCCL preparing to dump debug info. [default6]:[rank14]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 3] [PG 4 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank20]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 5] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank20]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 5] [PG 4 Rank 5] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank23]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 5] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank23]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 5] [PG 4 Rank 5] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 11] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default7]:[rank47]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 11] ProcessGroupNCCL preparing to dump debug info. [default7]:[rank47]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 11] [PG 4 Rank 11] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1316] [PG 4 Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2 [default4]:[rank28]:[E ProcessGroupNCCL.cpp:1153] [PG 4 Rank 7] ProcessGroupNCCL preparing to dump debug info. [default4]:[rank28]:[F ProcessGroupNCCL.cpp:1169] [PG 4 Rank 7] [PG 4 Rank 7] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 2 W0702 22:28:02.083000 139982293182272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1777698 closing signal SIGTERM W0702 22:28:02.086000 139982293182272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1777699 closing signal SIGTERM W0702 22:28:02.086000 139982293182272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1777700 closing signal SIGTERM W0702 22:28:02.086000 139982293182272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1777701 closing signal SIGTERM W0702 22:28:02.125000 140035368605504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 842653 closing signal SIGTERM W0702 22:28:02.125000 140035368605504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 842654 closing signal SIGTERM W0702 22:28:02.125000 140035368605504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 842655 closing signal SIGTERM W0702 22:28:02.125000 140035368605504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 842656 closing signal SIGTERM W0702 22:28:02.125000 139641247233856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1080510 closing signal SIGTERM W0702 22:28:02.127000 139641247233856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1080511 closing signal SIGTERM W0702 22:28:02.127000 139641247233856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1080512 closing signal SIGTERM W0702 22:28:02.127000 139641247233856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1080513 closing signal SIGTERM W0702 22:28:02.161000 139702449211200 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 469371 closing signal SIGTERM W0702 22:28:02.161000 139702449211200 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 469372 closing signal SIGTERM W0702 22:28:02.161000 139702449211200 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 469373 closing signal SIGTERM W0702 22:28:02.161000 139702449211200 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 469374 closing signal SIGTERM W0702 22:28:02.185000 140494472976192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805147 closing signal SIGTERM W0702 22:28:02.186000 140494472976192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805148 closing signal SIGTERM W0702 22:28:02.186000 140494472976192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805149 closing signal SIGTERM W0702 22:28:02.186000 140494472976192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1805150 closing signal SIGTERM W0702 22:28:02.279000 140270114883392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3165815 closing signal SIGTERM W0702 22:28:02.281000 140270114883392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3165816 closing signal SIGTERM W0702 22:28:02.281000 140270114883392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3165817 closing signal SIGTERM W0702 22:28:02.281000 140270114883392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3165818 closing signal SIGTERM E0702 22:28:06.885000 139641247233856 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 1080514) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-160-192.ec2.internal rank : 5 (local_rank: 5) exitcode : -6 (pid: 1080515) error_file: traceback : Signal 6 (SIGABRT) received by PID 1080515 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-160-192.ec2.internal rank : 6 (local_rank: 6) exitcode : -6 (pid: 1080516) error_file: traceback : Signal 6 (SIGABRT) received by PID 1080516 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-160-192.ec2.internal rank : 7 (local_rank: 7) exitcode : -6 (pid: 1080517) error_file: traceback : Signal 6 (SIGABRT) received by PID 1080517 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-160-192.ec2.internal rank : 4 (local_rank: 4) exitcode : -6 (pid: 1080514) error_file: traceback : Signal 6 (SIGABRT) received by PID 1080514 ============================================================ srun: error: ip-26-0-160-192: task 0: Exited with exit code 1 E0702 22:28:08.380000 140270114883392 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 3165819) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 E0702 22:28:08.429000 139702449211200 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 469375) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 E0702 22:28:08.460000 140035368605504 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 842657) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 22:28:08.507000 140270114883392 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3165746_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.533000 139702449211200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-178.ec2.internal_469301_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.543000 140270114883392 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3165746_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.559000 140270114883392 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3165746_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-163-226.ec2.internal rank : 21 (local_rank: 5) exitcode : -6 (pid: 3165820) error_file: traceback : Signal 6 (SIGABRT) received by PID 3165820 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-163-226.ec2.internal rank : 22 (local_rank: 6) exitcode : -6 (pid: 3165821) error_file: traceback : Signal 6 (SIGABRT) received by PID 3165821 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-163-226.ec2.internal rank : 23 (local_rank: 7) exitcode : -6 (pid: 3165822) error_file: traceback : Signal 6 (SIGABRT) received by PID 3165822 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-163-226.ec2.internal rank : 20 (local_rank: 4) exitcode : -6 (pid: 3165819) error_file: traceback : Signal 6 (SIGABRT) received by PID 3165819 ============================================================ W0702 22:28:08.565000 139702449211200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-178.ec2.internal_469301_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.566000 140035368605504 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_842584_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.588000 139702449211200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-178.ec2.internal_469301_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-161-178.ec2.internal rank : 13 (local_rank: 5) exitcode : -6 (pid: 469376) error_file: traceback : Signal 6 (SIGABRT) received by PID 469376 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-161-178.ec2.internal rank : 14 (local_rank: 6) exitcode : -6 (pid: 469377) error_file: traceback : Signal 6 (SIGABRT) received by PID 469377 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-161-178.ec2.internal rank : 15 (local_rank: 7) exitcode : -6 (pid: 469378) error_file: traceback : Signal 6 (SIGABRT) received by PID 469378 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-161-178.ec2.internal rank : 12 (local_rank: 4) exitcode : -6 (pid: 469375) error_file: traceback : Signal 6 (SIGABRT) received by PID 469375 ============================================================ W0702 22:28:08.600000 140035368605504 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_842584_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.618000 140035368605504 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_842584_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-165-24.ec2.internal rank : 29 (local_rank: 5) exitcode : -6 (pid: 842658) error_file: traceback : Signal 6 (SIGABRT) received by PID 842658 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-165-24.ec2.internal rank : 30 (local_rank: 6) exitcode : -6 (pid: 842659) error_file: traceback : Signal 6 (SIGABRT) received by PID 842659 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-165-24.ec2.internal rank : 31 (local_rank: 7) exitcode : -6 (pid: 842660) error_file: traceback : Signal 6 (SIGABRT) received by PID 842660 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-165-24.ec2.internal rank : 28 (local_rank: 4) exitcode : -6 (pid: 842657) error_file: traceback : Signal 6 (SIGABRT) received by PID 842657 ============================================================ E0702 22:28:08.714000 140494472976192 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 1805151) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 22:28:08.813000 140494472976192 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-168-238.ec2.internal_1805078_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.846000 140494472976192 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-168-238.ec2.internal_1805078_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:08.863000 140494472976192 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-168-238.ec2.internal_1805078_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-168-238.ec2.internal rank : 37 (local_rank: 5) exitcode : -6 (pid: 1805152) error_file: traceback : Signal 6 (SIGABRT) received by PID 1805152 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-168-238.ec2.internal rank : 38 (local_rank: 6) exitcode : -6 (pid: 1805153) error_file: traceback : Signal 6 (SIGABRT) received by PID 1805153 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-168-238.ec2.internal rank : 39 (local_rank: 7) exitcode : -6 (pid: 1805154) error_file: traceback : Signal 6 (SIGABRT) received by PID 1805154 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-168-238.ec2.internal rank : 36 (local_rank: 4) exitcode : -6 (pid: 1805151) error_file: traceback : Signal 6 (SIGABRT) received by PID 1805151 ============================================================ E0702 22:28:08.906000 139982293182272 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 1777702) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 22:28:09.020000 139982293182272 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-86.ec2.internal_1777629_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:09.052000 139982293182272 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-86.ec2.internal_1777629_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 22:28:09.070000 139982293182272 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-86.ec2.internal_1777629_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_22:28:02 host : ip-26-0-169-86.ec2.internal rank : 45 (local_rank: 5) exitcode : -6 (pid: 1777703) error_file: traceback : Signal 6 (SIGABRT) received by PID 1777703 [2]: time : 2024-07-02_22:28:02 host : ip-26-0-169-86.ec2.internal rank : 46 (local_rank: 6) exitcode : -6 (pid: 1777704) error_file: traceback : Signal 6 (SIGABRT) received by PID 1777704 [3]: time : 2024-07-02_22:28:02 host : ip-26-0-169-86.ec2.internal rank : 47 (local_rank: 7) exitcode : -6 (pid: 1777705) error_file: traceback : Signal 6 (SIGABRT) received by PID 1777705 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_22:28:02 host : ip-26-0-169-86.ec2.internal rank : 44 (local_rank: 4) exitcode : -6 (pid: 1777702) error_file: traceback : Signal 6 (SIGABRT) received by PID 1777702 ============================================================ srun: error: ip-26-0-161-178: task 1: Exited with exit code 1 srun: error: ip-26-0-163-226: task 2: Exited with exit code 1 srun: error: ip-26-0-165-24: task 3: Exited with exit code 1 srun: error: ip-26-0-168-238: task 4: Exited with exit code 1 srun: error: ip-26-0-169-86: task 5: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.