======================== START TIME: Sat Jul 6 09:18:51 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING [2024-07-06 09:18:59,140] torch.distributed.run: [WARNING] [2024-07-06 09:18:59,140] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,140] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:18:59,140] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,677] torch.distributed.run: [WARNING] [2024-07-06 09:18:59,677] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,677] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:18:59,677] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] [2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,942] torch.distributed.run: [WARNING] [2024-07-06 09:18:59,942] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,942] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:18:59,942] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,939] torch.distributed.run: [WARNING] [2024-07-06 09:18:59,939] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:18:59,939] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:18:59,939] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,078] torch.distributed.run: [WARNING] [2024-07-06 09:19:00,078] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,078] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:19:00,078] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,128] torch.distributed.run: [WARNING] [2024-07-06 09:19:00,128] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,128] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:19:00,128] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,289] torch.distributed.run: [WARNING] [2024-07-06 09:19:00,289] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:19:00,289] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:19:00,289] torch.distributed.run: [WARNING] ***************************************** [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Config: [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: run='%date_%jobid', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: seed=42, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: step=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: consumed_train_samples=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: benchmark_csv_path=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: ignore_sanity_checks=True), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: parallelism=ParallelismArgs(dp=2, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pp=32, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tp=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pp_engine=, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tp_mode=, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tp_linear_async_communication=False, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: expert_parallel_size=1), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: eos_token_id=2, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hidden_act='silu', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hidden_size=2048, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: initializer_range=0.02, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: intermediate_size=4096, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: is_llama_config=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: max_position_embeddings=4096, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_attention_heads=32, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_hidden_layers=24, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_key_value_heads=32, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pad_token_id=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pretraining_tp=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rope_scaling=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rope_theta=10000.0, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tie_word_embeddings=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: use_cache=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: vocab_size=50257), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: init_method=RandomInit(std=0.025), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: dtype=torch.bfloat16, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: make_vocab_size_divisible_by=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: ddp_bucket_cap_mb=25), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tokenizer_revision=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tokenizer_max_length=None), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: checkpoints=CheckpointsArgs(checkpoints_path=PosixPath('/dev/null'), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: checkpoint_interval=100000, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: save_initial_state=False, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: resume_checkpoint_path=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: checkpoints_path_is_shared_file_system=False), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: logging=LoggingArgs(log_level='info', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: log_level_replica='info', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: iteration_step_info_interval=1), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tokens=TokensArgs(sequence_length=4096, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: train_steps=20, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: micro_batch_size=8, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: batch_accumulation_per_replica=64, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: val_check_interval=-1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: limit_val_batches=0, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: limit_test_batches=0), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: adam_beta1=0.9, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: adam_beta2=0.95, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: torch_adam_is_fused=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: name='adamW'), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: zero_stage=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: weight_decay=0.01, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: clip_grad=1.0, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: accumulate_grad_in_fp32=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lr_warmup_steps=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lr_warmup_style='linear', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lr_decay_style='linear', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lr_decay_steps=19, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lr_decay_starting_step=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: min_decay_lr=1e-05)), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: start_training_step=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hf_dataset_splits='train', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hf_dataset_config_name=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: dataset_processing_num_proc_per_process=64, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: dataset_overwrite_cache=False, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: text_column_name='text'), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: seed=42, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_loading_workers=0))], [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: profiler=ProfilerArgs(profiler_export_path=PosixPath('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-2_tp-1_pp-32_mbz-8')), [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: lighteval=None) [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Model Config: [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: eos_token_id=2, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hidden_act='silu', [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: hidden_size=2048, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: initializer_range=0.02, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: intermediate_size=4096, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: is_llama_config=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: max_position_embeddings=4096, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_attention_heads=32, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_hidden_layers=24, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: num_key_value_heads=32, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pad_token_id=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: pretraining_tp=1, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rope_scaling=None, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: rope_theta=10000.0, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: tie_word_embeddings=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: use_cache=True, [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: vocab_size=50257) [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Building model.. [default0]:07/06/2024 09:19:23 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Setting PP block ranks... [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=19|TP=0|ip-26-0-169-207]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=19|TP=0|ip-26-0-169-207]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=19|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=30|TP=0|ip-26-0-171-230]: Local number of parameters: 0 (0.00MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=30|TP=0|ip-26-0-171-230]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=30|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=17|TP=0|ip-26-0-169-207]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=17|TP=0|ip-26-0-169-207]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=17|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=18|TP=0|ip-26-0-169-207]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=18|TP=0|ip-26-0-169-207]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=18|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=28|TP=0|ip-26-0-171-230]: Local number of parameters: 0 (0.00MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=28|TP=0|ip-26-0-171-230]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=28|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=16|TP=0|ip-26-0-169-207]: Local number of parameters: 41.9M (80.01MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=16|TP=0|ip-26-0-169-207]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=16|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Total number of parameters: 1.21G (2312.82MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Local number of parameters: 145M (276.32MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 277.33MiB. Peak allocated: 279.36MiB Peak reserved: 294.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Parametrizing model parameters using StandardParametrizator [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=31|TP=0|ip-26-0-171-230]: Local number of parameters: 0 (0.00MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=31|TP=0|ip-26-0-171-230]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=31|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=24|TP=0|ip-26-0-171-168]: Local number of parameters: 2.05K (0.00MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=24|TP=0|ip-26-0-171-168]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=24|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=29|TP=0|ip-26-0-171-230]: Local number of parameters: 0 (0.00MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=29|TP=0|ip-26-0-171-230]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=29|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=3|TP=0|ip-26-0-168-120]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=3|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=3|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=1|TP=0|ip-26-0-168-120]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=1|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=1|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=2|TP=0|ip-26-0-168-120]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=2|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=2|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=9|TP=0|ip-26-0-169-132]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=9|TP=0|ip-26-0-169-132]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=9|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=27|TP=0|ip-26-0-171-168]: Local number of parameters: 0 (0.00MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=27|TP=0|ip-26-0-171-168]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=27|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=8|TP=0|ip-26-0-169-132]: Local number of parameters: 41.9M (80.01MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=8|TP=0|ip-26-0-169-132]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=8|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=12|TP=0|ip-26-0-169-139]: Local number of parameters: 41.9M (80.01MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=12|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=14|TP=0|ip-26-0-169-139]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=14|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=14|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=12|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-132]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-132]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=10|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=26|TP=0|ip-26-0-171-168]: Local number of parameters: 0 (0.00MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=26|TP=0|ip-26-0-171-168]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=26|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=15|TP=0|ip-26-0-169-139]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=15|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=15|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=25|TP=0|ip-26-0-171-168]: Local number of parameters: 103M (196.32MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=25|TP=0|ip-26-0-171-168]: [After model building] Memory usage: 196.33MiB. Peak allocated: 196.35MiB Peak reserved: 200.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=25|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=6|TP=0|ip-26-0-168-238]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=6|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=6|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=4|TP=0|ip-26-0-168-238]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=7|TP=0|ip-26-0-168-238]: Local number of parameters: 41.9M (80.01MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=4|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=4|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=7|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=7|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-132]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-132]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=11|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=13|TP=0|ip-26-0-169-139]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=13|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=13|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=22|TP=0|ip-26-0-169-86]: Local number of parameters: 41.9M (80.01MiB) [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=22|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default4]:07/06/2024 09:19:39 [INFO|DP=0|PP=22|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=23|TP=0|ip-26-0-169-86]: Local number of parameters: 41.9M (80.01MiB) [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=23|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default6]:07/06/2024 09:19:39 [INFO|DP=0|PP=23|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=20|TP=0|ip-26-0-169-86]: Local number of parameters: 41.9M (80.01MiB) [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=20|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default0]:07/06/2024 09:19:39 [INFO|DP=0|PP=20|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=5|TP=0|ip-26-0-168-238]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=5|TP=0|ip-26-0-168-238]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=5|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=21|TP=0|ip-26-0-169-86]: Local number of parameters: 41.9M (80.01MiB) [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=21|TP=0|ip-26-0-169-86]: [After model building] Memory usage: 81.02MiB. Peak allocated: 83.05MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:19:39 [INFO|DP=0|PP=21|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=18|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=19|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=1|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=30|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=16|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=17|TP=0|ip-26-0-169-207]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=0|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=31|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=29|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=28|TP=0|ip-26-0-171-230]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=2|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=27|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=3|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=24|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=8|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=26|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=10|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=12|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=13|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=14|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=15|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=25|TP=0|ip-26-0-171-168]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=9|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=4|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=7|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default1]:07/06/2024 09:19:40 [INFO|DP=1|PP=20|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=23|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default7]:07/06/2024 09:19:40 [INFO|DP=1|PP=11|TP=0|ip-26-0-169-132]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=5|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default3]:07/06/2024 09:19:40 [INFO|DP=1|PP=21|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=6|TP=0|ip-26-0-168-238]: No checkpoint path provided. [default5]:07/06/2024 09:19:40 [INFO|DP=1|PP=22|TP=0|ip-26-0-169-86]: No checkpoint path provided. [default0]:07/06/2024 09:19:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/06/2024 09:19:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/06/2024 09:19:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [ZeRO sharding] DP Rank 0 has 72.4M out of 145M (50.00%) params' optimizer states [default0]:07/06/2024 09:19:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [ZeRO sharding] DP Rank 1 has 72.4M out of 145M (50.00%) params' optimizer states [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in [default2]: trainer = DistributedTrainer(config_file) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 185, in __init__ [default2]: self.optimizer, self.grad_accumulator = init_optimizer_and_grad_accumulator( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/helpers.py", line 401, in init_optimizer_and_grad_accumulator [default2]: param = model.get_parameter(optim_model_param_name) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 714, in get_parameter [default2]: mod: torch.nn.Module = self.get_submodule(module_path) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 681, in get_submodule [default2]: raise AttributeError(mod._get_name() + " has no " [default2]:AttributeError: PipelineBlock has no attribute `pp_block` [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in [default3]: trainer = DistributedTrainer(config_file) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 185, in __init__ [default3]: self.optimizer, self.grad_accumulator = init_optimizer_and_grad_accumulator( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/helpers.py", line 401, in init_optimizer_and_grad_accumulator [default3]: param = model.get_parameter(optim_model_param_name) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 714, in get_parameter [default3]: mod: torch.nn.Module = self.get_submodule(module_path) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 681, in get_submodule [default3]: raise AttributeError(mod._get_name() + " has no " [default3]:AttributeError: PipelineBlock has no attribute `pp_block` [default0]:07/06/2024 09:19:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/06/2024 09:19:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Using `datasets` library [default0]:07/06/2024 09:19:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:19:44 [WARNING|DP=0|PP=0|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [2024-07-06 09:19:46,467] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804641 closing signal SIGTERM [2024-07-06 09:19:46,467] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804642 closing signal SIGTERM [2024-07-06 09:19:46,468] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804645 closing signal SIGTERM [2024-07-06 09:19:46,468] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804646 closing signal SIGTERM [2024-07-06 09:19:46,469] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804647 closing signal SIGTERM [2024-07-06 09:19:46,469] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2804648 closing signal SIGTERM [2024-07-06 09:19:48,383] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 2804643) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:19:46 host : ip-26-0-171-168.ec2.internal rank : 51 (local_rank: 3) exitcode : 1 (pid: 2804644) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:19:46 host : ip-26-0-171-168.ec2.internal rank : 50 (local_rank: 2) exitcode : 1 (pid: 2804643) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-171-168: task 6: Exited with exit code 1 [default1]:07/06/2024 09:19:48 [WARNING|DP=1|PP=8|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:48 [WARNING|DP=1|PP=10|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:19:48 [WARNING|DP=0|PP=8|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:19:48 [WARNING|DP=0|PP=10|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:19:48 [WARNING|DP=0|PP=14|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:19:48 [WARNING|DP=0|PP=12|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:48 [WARNING|DP=1|PP=14|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:48 [WARNING|DP=1|PP=15|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:19:48 [WARNING|DP=1|PP=13|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:19:48 [WARNING|DP=1|PP=12|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:19:48 [WARNING|DP=1|PP=9|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:48 [WARNING|DP=0|PP=11|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:48 [WARNING|DP=0|PP=15|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:48 [WARNING|DP=0|PP=13|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:48 [WARNING|DP=1|PP=11|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:19:48 [WARNING|DP=1|PP=4|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:19:48 [WARNING|DP=0|PP=6|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:48 [WARNING|DP=1|PP=7|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:19:48 [WARNING|DP=1|PP=5|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:19:48 [WARNING|DP=0|PP=4|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:48 [WARNING|DP=0|PP=7|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:48 [WARNING|DP=1|PP=6|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:48 [WARNING|DP=0|PP=5|TP=0|ip-26-0-168-238]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:49 [WARNING|DP=0|PP=9|TP=0|ip-26-0-169-132]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1182] [Rank 56] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<51368> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa99f378d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fa9a051ffa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fa9a052027b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fa9a0523c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa9a0524839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fa9ea228e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fa9ef330609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fa9ef0fb353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 56] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<51368> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa99f378d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fa9a051ffa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fa9a052027b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fa9a0523c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa9a0524839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fa9ea228e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fa9ef330609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fa9ef0fb353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa99f378d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7fa9a027ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fa9ea228e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fa9ef330609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7fa9ef0fb353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:[rank40]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1182] [Rank 40] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<44592> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f511721bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f51183c2fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f51183c327b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f51183c6c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f51183c7839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f51620cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f51671d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f5166f9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 40] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<44592> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f511721bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f51183c2fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f51183c327b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f51183c6c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f51183c7839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f51620cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f51671d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f5166f9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f511721bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f511811db11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f51620cbe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f51671d3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f5166f9e353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:[rank32]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:1182] [Rank 32] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<57838> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6aafb8ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f6ab0d31fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f6ab0d3227b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f6ab0d35c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f6ab0d36839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f6afaa3ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f6affb42609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f6aff90d353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 32] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-171-168.ec2.internal<57838> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6aafb8ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f6ab0d31fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f6ab0d3227b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f6ab0d35c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f6ab0d36839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f6afaa3ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f6affb42609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f6aff90d353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6aafb8ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f6ab0a8cb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f6afaa3ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f6affb42609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f6aff90d353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default4]:07/06/2024 09:19:53 [WARNING|DP=0|PP=30|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:53 [WARNING|DP=1|PP=30|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:53 [WARNING|DP=0|PP=31|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:53 [WARNING|DP=1|PP=31|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:53 [WARNING|DP=0|PP=29|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:19:53 [WARNING|DP=1|PP=28|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:19:53 [WARNING|DP=1|PP=29|TP=0|ip-26-0-171-230]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:53 [WARNING|DP=0|PP=19|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:53 [WARNING|DP=1|PP=19|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:19:53 [WARNING|DP=0|PP=18|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:53 [WARNING|DP=0|PP=17|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:19:53 [WARNING|DP=1|PP=17|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:19:53 [WARNING|DP=1|PP=16|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:53 [WARNING|DP=1|PP=18|TP=0|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:19:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [Training Plan] There are 1 training stages [default0]:07/06/2024 09:19:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [Stage Training Stage] start from step 1 [default0]:07/06/2024 09:19:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [default0]:07/06/2024 09:19:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: [Start training] datetime: 2024-07-06 09:19:54.507438 | mbs: 8 | grad_accum: 64 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/06/2024 09:19:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/06/2024 09:19:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-168-120]: Memory usage: 1106.31MiB. Peak allocated 1106.31MiB. Peak reserved: 1126.00MiB [2024-07-06 09:19:56,471] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348268 closing signal SIGTERM [2024-07-06 09:19:56,471] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348269 closing signal SIGTERM [2024-07-06 09:19:56,472] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348270 closing signal SIGTERM [2024-07-06 09:19:56,473] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348271 closing signal SIGTERM [2024-07-06 09:19:56,472] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228375 closing signal SIGTERM [2024-07-06 09:19:56,472] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228376 closing signal SIGTERM [2024-07-06 09:19:56,472] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228377 closing signal SIGTERM [2024-07-06 09:19:56,474] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348272 closing signal SIGTERM [2024-07-06 09:19:56,474] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348273 closing signal SIGTERM [2024-07-06 09:19:56,473] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228378 closing signal SIGTERM [2024-07-06 09:19:56,473] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228379 closing signal SIGTERM [2024-07-06 09:19:56,475] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3348274 closing signal SIGTERM [2024-07-06 09:19:56,474] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228380 closing signal SIGTERM [2024-07-06 09:19:56,475] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3228381 closing signal SIGTERM [2024-07-06 09:19:56,476] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51098 closing signal SIGTERM [2024-07-06 09:19:56,477] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51099 closing signal SIGTERM [2024-07-06 09:19:56,477] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51100 closing signal SIGTERM [2024-07-06 09:19:56,478] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51101 closing signal SIGTERM [2024-07-06 09:19:56,479] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51102 closing signal SIGTERM [2024-07-06 09:19:56,479] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51103 closing signal SIGTERM [2024-07-06 09:19:56,480] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 51104 closing signal SIGTERM [2024-07-06 09:19:58,393] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 3228374) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:19:56 host : ip-26-0-171-230.ec2.internal rank : 56 (local_rank: 0) exitcode : -6 (pid: 3228374) error_file: traceback : Signal 6 (SIGABRT) received by PID 3228374 ============================================================ [default0]:[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-169-207.ec2.internal<47448> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2dfc234d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f2dfd3dbfa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f2dfd3dc27b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f2dfd3dfc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2dfd3e0839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f2e470e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f2e4c1ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f2e4bfb7353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 0] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-169-207.ec2.internal<47448> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2dfc234d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f2dfd3dbfa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f2dfd3dc27b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f2dfd3dfc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2dfd3e0839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f2e470e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f2e4c1ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f2e4bfb7353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2dfc234d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f2dfd136b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f2e470e4e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f2e4c1ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f2e4bfb7353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: srun: error: ip-26-0-171-230: task 7: Exited with exit code 1 [2024-07-06 09:19:59,089] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 3348267) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:19:56 host : ip-26-0-169-207.ec2.internal rank : 32 (local_rank: 0) exitcode : -6 (pid: 3348267) error_file: traceback : Signal 6 (SIGABRT) received by PID 3348267 ============================================================ [2024-07-06 09:19:59,137] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 51097) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:19:56 host : ip-26-0-169-86.ec2.internal rank : 40 (local_rank: 0) exitcode : -6 (pid: 51097) error_file: traceback : Signal 6 (SIGABRT) received by PID 51097 ============================================================ [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f313d9cfd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f317598918e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f31759839a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3175983ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3175984b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3175939f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3175939f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3175939f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3175939f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f313eb77c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f313eb7ec5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f313eba1b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f317592c439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7f3175937330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f31759373c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f3174f7d3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f3171afca88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f317593da84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #18: + 0x584ed35 (0x7f3175942d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f31881f4eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f3187970ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x560dea8045a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x560dea7fda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x560dea810866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560dea7f9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x560dea804a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x560dea810f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560dea7f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x560dea804a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560dea7fcf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x560dea80ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x560dea8d1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x560dea7fda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560dea7f93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x560dea804a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560dea7f4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x560dea804a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x560dea810f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560dea7f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x560dea810f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560dea7f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x560dea804a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560dea7fd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x560dea80ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x560dea8d1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x560dea811067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560dea7f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560dea7f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x560dea810f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560dea7f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x560dea810582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x560dea810f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: pipeline_state.run_communication() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: dist.recv( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f917ca78d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0x589518e (0x7f91b4a3218e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f91b4a2c9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f91b4a2cce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f91b4a2db11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91b49e2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91b49e2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91b49e2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91b49e2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f917dc20c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f917dc27c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f917dc4ab60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #12: + 0x5838439 (0x7f91b49d5439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7f91b49e0330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #14: + 0x58433c5 (0x7f91b49e03c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #15: + 0x4e893cc (0x7f91b40263cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #16: + 0x1a08a88 (0x7f91b0ba5a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7f91b49e6a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #18: + 0x584ed35 (0x7f91b49ebd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7f91c729deee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #20: + 0x413ea4 (0x7f91c6a19ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #21: + 0x1445a6 (0x562fa10f55a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x562fa10eea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x562fa1101866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562fa10ea142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x562fa10f5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x562fa1101f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562fa10e82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x562fa10f5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:Traceback (most recent call last): [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562fa10edf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: outputs = self.pipeline_engine.train_batch_iter( [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x562fa10ffc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default0]:frame #38: + 0x211239 (0x562fa11c2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x562fa10eea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562fa10ea3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x562fa10f5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562fa10e5c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x562fa10f5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:frame #45: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x562fa1101f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562fa10e82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: output = model(**micro_batch) [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #48: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x562fa1101f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562fa10e82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: return self._call_impl(*args, **kwargs) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x562fa10f5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562fa10ee007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: return forward_call(*args, **kwargs) [default2]: return forward_call(*args, **kwargs) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x562fa10ffc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]:frame #54: + 0x211239 (0x562fa11c2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: sharded_logits = self.model( [default0]:frame #55: PyObject_Call + 0x207 (0x562fa1102067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: sharded_logits = self.model( [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562fa10e82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default0]:frame #57: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562fa10e68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:frame #59: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x562fa1101f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:Traceback (most recent call last): [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562fa10e82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x562fa1101582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:frame #63: PyObject_Call + 0xbc (0x562fa1101f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: return forward_call(*args, **kwargs) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: trainer.train(dataloader) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return forward_call(*args, **kwargs) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default5]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: dist.recv( [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1524464d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: pipeline_state.run_communication() [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: recv_activation_tensor = recv_activation() [default2]:frame #1: + 0x589518e (0x7f155c41e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f155c4189a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f155c418ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: dist.recv( [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f155c419b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f155c3cef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f155c3cef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fea4b8b9d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f155c3cef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #1: + 0x589518e (0x7fea8387318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f155c3cef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f152560cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fea8386d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fea8386dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fea8386eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: recv_activation_tensor = recv_activation() [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea83823f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f1525613c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f1525636b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: return func(*args, **kwargs) [default2]:frame #12: + 0x5838439 (0x7f155c3c1439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:frame #13: + 0x5843330 (0x7f155c3cc330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f155c3cc3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #15: + 0x4e893cc (0x7f155ba123cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #16: + 0x1a08a88 (0x7f1558591a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default2]:frame #17: + 0x5849a84 (0x7f155c3d2a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea83823f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6cabb26d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea83823f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea83823f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #1: + 0x589518e (0x7f6ce3ae018e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #18: + 0x584ed35 (0x7f155c3d7d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #19: + 0xc97eee (0x7f156ec89eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f156e405ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f6ce3ada9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #21: + 0x1445a6 (0x5612f36a15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: dist.recv( [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5612f369aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6ce3adace2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fea4ca61c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6ce3adbb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fea4ca68c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fea4ca8bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default2]:frame #23: + 0x150866 (0x5612f36ad866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5612f3696142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ce3a90f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #12: + 0x5838439 (0x7fea83816439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #13: + 0x5843330 (0x7fea83821330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ce3a90f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7fea838213c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7fea82e673cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f67bc7cdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #16: + 0x1a08a88 (0x7fea7f9e6a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7fea83827a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ce3a90f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5612f36a1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ce3a90f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #1: + 0x589518e (0x7f67f478718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #18: + 0x584ed35 (0x7fea8382cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7fea960deeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6cacccec69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f67f47819a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #26: PyObject_Call + 0xbc (0x5612f36adf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5612f36942b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f6caccd5c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f67f4781ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f67f4782b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #20: + 0x413ea4 (0x7fea9585aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f67f4737f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f67f4737f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #21: + 0x1445a6 (0x55e963cf85a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55e963cf1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f6caccf8b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #23: + 0x150866 (0x55e963d04866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7f6ce3a83439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e963ced142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5612f36a1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7f6ce3a8e330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7f6ce3a8e3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f67f4737f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f67f4737f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55e963cf8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x55e963d04f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f67bd975c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f67bd97cc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e963ceb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f67bd99fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #15: + 0x4e893cc (0x7f6ce30d43cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7f6cdfc53a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7f6ce3a94a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #12: + 0x5838439 (0x7f67f472a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #34: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #13: + 0x5843330 (0x7f67f4735330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f67f47353c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5612f3699f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #15: + 0x4e893cc (0x7f67f3d7b3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5612f36abc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x5612f376e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #16: + 0x1a08a88 (0x7f67f08faa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f67f473ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5612f369aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f67f4740d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55e963cf8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #18: + 0x584ed35 (0x7f6ce3a99d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #30: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #19: + 0xc97eee (0x7f6806ff2eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #19: + 0xc97eee (0x7f6cf634beee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #20: + 0x413ea4 (0x7f6cf5ac7ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f680676eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5612f36963e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5612f36a1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x5571756185a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #34: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x557175611a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5612f3691c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5612f36a1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x557175624866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e963cf0f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55e963d02c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #21: + 0x1445a6 (0x55d02f79f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55d02f798a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55717560d142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55d02f7ab866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x557175618a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x55e963dc5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55e963cf1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e963ced3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55e963cf8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d02f794142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e963ce8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55e963cf8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55d02f79fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x5612f36adf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x55d02f7abf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d02f7922b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5612f36942b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55d02f79fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x557175624f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x5612f36adf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5612f36942b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55717560b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5612f36a1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x557175618a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5612f369a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5612f36abc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x5612f376e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d02f797f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55d02f7a9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x5612f36ae067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5612f36942b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x557175610f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5612f36928fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x557175622c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x5571756e5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x55e963d04f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e963ceb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x557175611a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55d02f86c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x55e963d04f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e963ceb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55d02f798a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55e963cf8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e963cf1007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55717560d3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x5612f36adf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5612f36942b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x5612f36ad582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x5612f36adf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x557175618a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55e963d02c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d02f7943e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55d02f79fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x55e963dc5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:Traceback (most recent call last): [default1]:frame #55: PyObject_Call + 0x207 (0x55e963d05067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e963ceb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e963ce98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x55e963d04f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e963ceb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x55e963d04582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x55e963d04f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: outputs = self.pipeline_engine.train_batch_iter( [default1]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: sharded_logits = self.model( [default1]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: return forward_call(*args, **kwargs) [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80d6985d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0x589518e (0x7f810e93f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f810e9399a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f810e939ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: pipeline_state.run_communication() [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f810e93ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f810e8eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f810e8eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f810e8eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f810e8eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: recv_activation_tensor = recv_activation() [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f80d7b2dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f80d7b34c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f80d7b57b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #12: + 0x5838439 (0x7f810e8e2439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: + 0x5843330 (0x7f810e8ed330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #14: + 0x58433c5 (0x7f810e8ed3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7f810df333cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #16: + 0x1a08a88 (0x7f810aab2a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7f810e8f3a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: dist.recv( [default1]:frame #18: + 0x584ed35 (0x7f810e8f8d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]:frame #19: + 0xc97eee (0x7f81211aaeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #20: + 0x413ea4 (0x7f8120926ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #21: + 0x1445a6 (0x556d49ed45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x556d49ecda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #23: + 0x150866 (0x556d49ee0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556d49ec9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x556d49ed4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return func(*args, **kwargs) [default1]:frame #26: PyObject_Call + 0xbc (0x556d49ee0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556d49ec72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x556d49ed4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f776e9fdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0x589518e (0x7f77a69b718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #30: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f77a69b19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f77a69b1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f77a69b2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #32: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f77a6967f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f77a6967f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f77a6967f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #34: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f77a6967f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556d49eccf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x556d49edec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x556d49fa1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d02f78fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55d02f79fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x556d49ecda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f776fba5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f776fbacc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556d49ec93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f776fbcfb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #46: PyObject_Call + 0xbc (0x55d02f7abf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d02f7922b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557175608c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x556d49ed4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x55d02f7abf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d02f7922b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x557175618a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #12: + 0x5838439 (0x7f77a695a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55d02f79fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556d49ec4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #13: + 0x5843330 (0x7f77a6965330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #45: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x557175624f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x556d49ed4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d02f798007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #14: + 0x58433c5 (0x7f77a69653c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55717560b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x556d49ee0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55d02f7a9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #15: + 0x4e893cc (0x7f77a5fab3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #16: + 0x1a08a88 (0x7f77a2b2aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #48: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #17: + 0x5849a84 (0x7f77a696ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #54: + 0x211239 (0x55d02f86c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556d49ec72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x556d49ee0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x557175624f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55717560b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556d49ec72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x556d49ed4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x557175618a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556d49ecd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x556d49edec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x556d49fa1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x556d49ee1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556d49ec72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x557175611007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x557175622c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x5571756e5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x55d02f7ac067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556d49ec58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x557175625067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x556d49ee0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556d49ec72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d02f7922b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x556d49ee0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #18: + 0x584ed35 (0x7f77a6970d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55717560b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x556d49ee0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5571756098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #19: + 0xc97eee (0x7f77b9222eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #20: + 0x413ea4 (0x7f77b899eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #21: + 0x1445a6 (0x55bb0803a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x557175624f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55717560b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bb08033a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x55bb08046866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bb0802f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x557175624582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x557175624f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bb0803aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x55bb08046f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb0802d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bb0803aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bb08032f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d02f7908fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bb08044c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x55bb08107239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bb08033a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bb0802f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x55d02f7abf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d02f7922b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x55d02f7ab582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bb0803aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bb0802ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bb0803aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x55d02f7abf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #45: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x55bb08046f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb0802d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:Traceback (most recent call last): [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]:frame #49: PyObject_Call + 0xbc (0x55bb08046f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb0802d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bb0803aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bb08033007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bb08044c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:Traceback (most recent call last): [default0]:frame #54: + 0x211239 (0x55bb08107239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x55bb08047067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb0802d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bb0802b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:frame #59: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55bb08046f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb0802d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x55bb08046582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: trainer.train(dataloader) [default0]:frame #63: PyObject_Call + 0xbc (0x55bb08046f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default6]: trainer.train(dataloader) [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default4]: return self._call_impl(*args, **kwargs) [default0]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: outputs = self.pipeline_engine.train_batch_iter( [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: output = model(**micro_batch) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: return self._call_impl(*args, **kwargs) [default3]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: dist.recv( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return forward_call(*args, **kwargs) [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default7]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:Traceback (most recent call last): [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: return func(*args, **kwargs) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default0]: sharded_logits = self.model( [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default6]: return forward_call(*args, **kwargs) [default3]: sharded_logits = self.model( [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2efc80dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0x589518e (0x7f2f347c718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2f347c19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2f347c1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2f347c2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: pipeline_state.run_communication() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f34777f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f34777f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: return forward_call(*args, **kwargs) [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self._call_impl(*args, **kwargs) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f34777f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f34777f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2efd9b5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2efd9bcc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2efd9dfb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default4]:frame #12: + 0x5838439 (0x7f2f3476a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f2f34775330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7f2f347753c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #15: + 0x4e893cc (0x7f2f33dbb3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7f2f3093aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:frame #17: + 0x5849a84 (0x7f2f3477ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff78995fd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:frame #18: + 0x584ed35 (0x7f2f34780d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]:frame #1: + 0x589518e (0x7ff7c191918e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff7c19139a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return forward_call(*args, **kwargs) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff7c1913ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff7c1914b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff7c18c9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff7c18c9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:frame #19: + 0xc97eee (0x7f2f47032eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:frame #20: + 0x413ea4 (0x7f2f467aeea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #21: + 0x1445a6 (0x55f2e823b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self._call_impl(*args, **kwargs) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff7c18c9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55f2e8234a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff7c18c9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff78ab07c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:Traceback (most recent call last): [default7]: trainer.train(dataloader) [default4]: output = model(**micro_batch) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff78ab0ec5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: output = model(**micro_batch) [default5]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: return self._call_impl(*args, **kwargs) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: pipeline_state.run_communication() [default1]: pipeline_state.run_communication() [default3]: return forward_call(*args, **kwargs) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]:frame #23: + 0x150866 (0x55f2e8247866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f2e8230142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55f2e823ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: outputs = self.pipeline_engine.train_batch_iter( [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff78ab31b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:frame #26: PyObject_Call + 0xbc (0x55f2e8247f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2e822e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: recv_activation_tensor = recv_activation() [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55f2e823ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: outputs = self.pipeline_engine.train_batch_iter( [default4]: return forward_call(*args, **kwargs) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: sharded_logits = self.model( [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #32: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: recv_activation_tensor = recv_activation() [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7ff7c18bc439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #34: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: output = model(**micro_batch) [default7]:Traceback (most recent call last): [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f2e8233f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: return self._call_impl(*args, **kwargs) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55f2e8245c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:frame #38: + 0x211239 (0x55f2e8308239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7ff7c18c7330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: return self._call_impl(*args, **kwargs) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: trainer.train(dataloader) [default2]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: return forward_call(*args, **kwargs) [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55f2e8234a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f2e82303e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self._call_impl(*args, **kwargs) [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55f2e823ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f2e822bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7ff7c18c73c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: outputs = self.pipeline_engine.train_batch_iter( [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:frame #15: + 0x4e893cc (0x7ff7c0f0d3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:frame #16: + 0x1a08a88 (0x7ff7bda8ca88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #17: + 0x5849a84 (0x7ff7c18cda84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default7]: return self._call_impl(*args, **kwargs) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: sharded_logits = self.model( [default2]: output = model(**micro_batch) [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55f2e823ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #45: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: dist.recv( [default3]: dist.recv( [default7]: return forward_call(*args, **kwargs) [default4]:frame #46: PyObject_Call + 0xbc (0x55f2e8247f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fefcf2f0d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2e822e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7ff7c18d2d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #19: + 0xc97eee (0x7ff7d4184eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default6]:frame #20: + 0x413ea4 (0x7ff7d3900ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #48: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default0]:frame #1: + 0x589518e (0x7ff0072aa18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff0072a49a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: return forward_call(*args, **kwargs) [default7]: return self._call_impl(*args, **kwargs) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff0072a4ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #21: + 0x1445a6 (0x561bb49ad5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x561bb49a6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff0072a5b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #49: PyObject_Call + 0xbc (0x55f2e8247f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff00725af81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return forward_call(*args, **kwargs) [default2]: sharded_logits = self.model( [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2e822e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55f2e823ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff00725af81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: return self._call_impl(*args, **kwargs) [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f2e8234007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55f2e8245c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: return forward_call(*args, **kwargs) [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff00725af81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #23: + 0x150866 (0x561bb49b9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5b3f06d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: sharded_logits = self.model( [default1]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default4]:frame #54: + 0x211239 (0x55f2e8308239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x55f2e8248067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2e822e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default2]: return forward_call(*args, **kwargs) [default3]:frame #1: + 0x589518e (0x7fa5ebec018e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb81845bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #1: + 0x589518e (0x7fb85041518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff00725af81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fefd0498c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561bb49a2142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x561bb49ada2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: sharded_logits = self.model( [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fa5ebeba9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return forward_call(*args, **kwargs) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa5ebebace2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fefd049fc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #26: PyObject_Call + 0xbc (0x561bb49b9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fefd04c2b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: return self._call_impl(*args, **kwargs) [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fb85040f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: pipeline_state.run_communication() [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f2e822c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561bb49a02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default0]:frame #12: + 0x5838439 (0x7ff00724d439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7ff007258330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #14: + 0x58433c5 (0x7ff0072583c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #59: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: recv_activation_tensor = recv_activation() [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x561bb49ada2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x55f2e8247f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2e822e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa5ebebbb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5ebe70f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pipeline_state.run_communication() [default2]: return forward_call(*args, **kwargs) [default0]:frame #15: + 0x4e893cc (0x7ff00689e3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: recv_activation_tensor = recv_activation() [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb85040fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #62: + 0x150582 (0x55f2e8247582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5ebe70f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5ebe70f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #63: PyObject_Call + 0xbc (0x55f2e8247f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:frame #16: + 0x1a08a88 (0x7ff00341da88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:frame #17: + 0x5849a84 (0x7ff00725ea84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: recv_activation_tensor = recv_activation() [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5ebe70f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: pipeline_state.run_communication() [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb850410b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #30: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb8503c5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #18: + 0x584ed35 (0x7ff007263d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa5b50aec69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb8503c5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #32: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default2]: recv_activation_tensor = recv_activation() [default4]: return self._call_impl(*args, **kwargs) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fa5b50b5c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fa5b50d8b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: return func(*args, **kwargs) [default7]: dist.recv( [default6]:frame #34: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561bb49a5f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: dist.recv( [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb8503c5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #12: + 0x5838439 (0x7fa5ebe63439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7fa5ebe6e330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x561bb49b7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:frame #14: + 0x58433c5 (0x7fa5ebe6e3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7fa5eb4b43cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7ff019b15eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #38: + 0x211239 (0x561bb4a7a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x561bb49a6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561bb49a23e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default3]:frame #16: + 0x1a08a88 (0x7fa5e8033a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb8503c5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: return func(*args, **kwargs) [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]: recv_activation_tensor = recv_activation() [default7]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: dist.recv( [default4]: pipeline_state.run_communication() [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb819603c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fb81960ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #20: + 0x413ea4 (0x7ff019291ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #17: + 0x5849a84 (0x7fa5ebe74a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x561bb49ada2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561bb499dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4e518ced87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fb81962db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #21: + 0x1445a6 (0x55bb826605a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bb82659a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return func(*args, **kwargs) [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:frame #23: + 0x150866 (0x55bb8266c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x561bb49ada2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: recv_activation_tensor = recv_activation() [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bb82655142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x561bb49b9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #12: + 0x5838439 (0x7fb8503b8439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #13: + 0x5843330 (0x7fb8503c3330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb532eebd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561bb49a02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:frame #18: + 0x584ed35 (0x7fa5ebe79d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #48: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: dist.recv( [default7]:frame #1: + 0x589518e (0x7f4e8988818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bb82660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f11b5101d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #26: PyObject_Call + 0xbc (0x55bb8266cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #19: + 0xc97eee (0x7fa5fe72beee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #1: + 0x589518e (0x7fb56aea518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb826532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #1: + 0x589518e (0x7f11ed0bb18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #14: + 0x58433c5 (0x7fb8503c33c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fb56ae9f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f4e898829a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f11ed0b59a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: dist.recv( [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4e89882ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bb82660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb56ae9fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb56aea0b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return func(*args, **kwargs) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f11ed0b5ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #30: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4e89883b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f11ed0b6b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #49: PyObject_Call + 0xbc (0x561bb49b9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561bb49a02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #15: + 0x4e893cc (0x7fb84fa093cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc98ba3dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb56ae55f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f11ed06bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #1: + 0x589518e (0x7fc9c39f718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e89838f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e89838f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x561bb49ada2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #16: + 0x1a08a88 (0x7fb84c588a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561bb49a6007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb56ae55f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb56ae55f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x561bb49b7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:frame #54: + 0x211239 (0x561bb4a7a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f11ed06bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb56ae55f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]:frame #55: PyObject_Call + 0x207 (0x561bb49ba067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561bb49a02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default4]: return func(*args, **kwargs) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb534093c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fb53409ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #57: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc9c39f19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc9c39f1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561bb499e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc9c39f2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #20: + 0x413ea4 (0x7fa5fdea7ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e89838f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fb5340bdb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e89838f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #59: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x561bb49b9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561bb49a02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5a8e66d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #32: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #17: + 0x5849a84 (0x7fb8503c9a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #62: + 0x150582 (0x561bb49b9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x561bb49b9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4e52a76c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f11ed06bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f11ed06bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8dc67a4d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #12: + 0x5838439 (0x7fb56ae48439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #1: + 0x589518e (0x7fb5e0e2018e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #34: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #13: + 0x5843330 (0x7fb56ae53330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2ce11b6d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #18: + 0x584ed35 (0x7fb8503ced35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f4e52a7dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f4e52aa0b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #14: + 0x58433c5 (0x7fb56ae533c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #1: + 0x589518e (0x7f8dfe75e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc9c39a7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc9c39a7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc9c39a7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7fb56a4993cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fb5e0e1a9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #12: + 0x5838439 (0x7f4e8982b439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #21: + 0x1445a6 (0x55c77a3d15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55c77a3caa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #16: + 0x1a08a88 (0x7fb567018a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #1: + 0x589518e (0x7f2d1917018e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #23: + 0x150866 (0x55c77a3dd866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #17: + 0x5849a84 (0x7fb56ae59a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f11b62a9c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb5e0e1ace2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7f4e89836330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc9c39a7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f11b62b0c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f11b62d3b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f8dfe7589a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2d1916a9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bb82658f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7f11ed05e439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8dfe758ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c77a3c6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55c77a3d1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7f11ed069330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8dfe759b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb5e0e1bb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc98cbe5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #14: + 0x58433c5 (0x7f11ed0693c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f11ec6af3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e0dd0f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bb8266ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x55bb8272d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #16: + 0x1a08a88 (0x7f11e922ea88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7f11ed06fa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8dfe70ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e0dd0f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7f4e898363c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #26: PyObject_Call + 0xbc (0x55c77a3ddf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c77a3c42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc98cbecc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #18: + 0x584ed35 (0x7fb56ae5ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7fb57d710eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2d1916ace2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8dfe70ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8dfe70ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e0dd0f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7fb862c80eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7fb57ce8cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5e0dd0f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55c77a3d1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc98cc0fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bb82659a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bb826553e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #18: + 0x584ed35 (0x7f11ed074d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2d1916bb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #12: + 0x5838439 (0x7fc9c399a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7f4e88e7c3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #21: + 0x1445a6 (0x562e7fd335a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb5aa00ec69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8dfe70ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7f4e859fba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x562e7fd2ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #19: + 0xc97eee (0x7f11ff926eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d19120f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bb82660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x562e7fd3f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8dc794cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fb5aa015c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #17: + 0x5849a84 (0x7f4e8983ca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #20: + 0x413ea4 (0x7f11ff0a2ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f8dc7953c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #13: + 0x5843330 (0x7fc9c39a5330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bb82650c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562e7fd28142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x562e7fd33a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fb5aa038b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d19120f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f8dc7976b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #21: + 0x1445a6 (0x55760b9995a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #12: + 0x5838439 (0x7fb5e0dc3439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #26: PyObject_Call + 0xbc (0x562e7fd3ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #12: + 0x5838439 (0x7f8dfe701439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55760b992a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #13: + 0x5843330 (0x7fb5e0dce330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562e7fd262b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #13: + 0x5843330 (0x7f8dfe70c330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x562e7fd33a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d19120f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fb5e0dce3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7fc9c39a53c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d19120f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bb82660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x55760b9a5866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #14: + 0x58433c5 (0x7f8dfe70c3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7fb5e04143cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2ce235ec69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #20: + 0x413ea4 (0x7fb8623fcea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55760b98e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55760b999a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #16: + 0x1a08a88 (0x7fb5dcf93a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #45: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55760b9a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #15: + 0x4e893cc (0x7f8dfdd523cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #46: PyObject_Call + 0xbc (0x55bb8266cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2ce2365c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #17: + 0x5849a84 (0x7fb5e0dd4a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #18: + 0x584ed35 (0x7f4e89841d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55760b98c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #16: + 0x1a08a88 (0x7f8dfa8d1a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7f4e9c0f3eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7f4e9b86fea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #17: + 0x5849a84 (0x7f8dfe712a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2ce2388b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #12: + 0x5838439 (0x7f2d19113439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7fc9c2feb3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #21: + 0x1445a6 (0x5610c4e0f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5610c4e08a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562e7fd2bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #18: + 0x584ed35 (0x7fb5e0dd9d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7fc9bfb6aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x562e7fd3dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x562e7fe00239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #19: + 0xc97eee (0x7fb5f368beee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #21: + 0x1445a6 (0x5608a11475a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x5610c4e1b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5610c4e04142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55760b999a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f8dfe717d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb826532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x562e7fd2ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #20: + 0x413ea4 (0x7fb5f2e07ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #48: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #19: + 0xc97eee (0x7f8e10fc9eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #13: + 0x5843330 (0x7f2d1911e330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #49: PyObject_Call + 0xbc (0x55bb8266cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5610c4e0fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #20: + 0x413ea4 (0x7f8e10745ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #26: PyObject_Call + 0xbc (0x5610c4e1bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562e7fd283e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #21: + 0x1445a6 (0x55bc4519d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb826532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #14: + 0x58433c5 (0x7f2d1911e3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5610c4e022b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x562e7fd33a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562e7fd23c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x562e7fd33a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #21: + 0x1445a6 (0x5615fc3d75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #17: + 0x5849a84 (0x7fc9c39aba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #34: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bc45196a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7f2d187643cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5610c4e0fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x562e7fd3ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x55bc451a9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5615fc3d0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5608a1140a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562e7fd262b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x562e7fd3ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bc45192142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562e7fd262b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x562e7fd33a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562e7fd2c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x5615fc3e3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x562e7fd3dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #16: + 0x1a08a88 (0x7f2d152e3a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bc4519da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #17: + 0x5849a84 (0x7f2d19124a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55760b991f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x55bc451a9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55760b9a3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5615fc3cc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x562e7fe00239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x562e7fd40067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bc451902b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562e7fd262b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bc4519da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562e7fd248fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7fc9c39b0d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bb82660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5615fc3d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #19: + 0xc97eee (0x7fc9d6262eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c77a3c9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7f2d19129d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bb82659007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x5615fc3e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #20: + 0x413ea4 (0x7fc9d59deea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55c77a3dbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bb8266ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5615fc3ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #19: + 0xc97eee (0x7f2d2b9dbeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #38: + 0x211239 (0x55c77a49e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55c77a3caa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #20: + 0x413ea4 (0x7f2d2b157ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #32: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #23: + 0x150866 (0x5608a1153866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55760ba66239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5615fc3d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5608a113c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x55bb8272d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x562e7fd3ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562e7fd262b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x562e7fd3f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x562e7fd3ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55760b992a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #21: + 0x1445a6 (0x5560e6f605a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c77a3c63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55760b98e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55760b999a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bc45195f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x55bb8266d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb826532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55760b989c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #21: + 0x1445a6 (0x55d7629c05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55d7629b9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55760b999a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5560e6f59a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bc451a7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x55bc4526a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55d7629cc866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55c77a3d1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c77a3c1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x5560e6f6c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55c77a3d1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5608a1147a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x5608a1153f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55760b9a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55760b98c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5560e6f55142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5608a113a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55760b9a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55760b98c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55760b999a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d7629b5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55d7629c0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55760b992007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55760b9a3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55760ba66239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x55760b9a6067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5560e6f60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5608a1147a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bb826518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55760b98c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55760b98a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55760b9a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55760b98c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55760b9a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x5560e6f6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55760b9a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #34: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5560e6f532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5560e6f60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bc45196a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bc451923e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bc4519da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55d7629ccf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bc4518dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bc4519da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5610c4e07f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5610c4e19c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x55c77a3ddf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5610c4edc239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55bb8266cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d7629b32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55d7629c0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x55bc451a9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5610c4e08a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5610c4e043e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5615fc3cff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5610c4e0fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5615fc3e1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c77a3c42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bc451902b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #30: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x5615fc4a4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb826532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5615fc3d0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #34: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x55c77a3ddf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5610c4dffc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5610c4e0fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c77a3c42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55c77a3d1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5615fc3cc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x55bc451a9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5615fc3d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5560e6f58f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bc451902b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5615fc3c7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5560e6f6ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bc4519da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5615fc3d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bc45196007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bc451a7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c77a3ca007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55c77a3dbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x55bc4526a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55c77a49e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x5560e702d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x5615fc3e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5608a113ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5608a1151c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5560e6f59a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x5610c4e1bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5615fc3ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5610c4e022b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5560e6f553e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x5608a1214239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x55bc451aa067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5608a1140a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x55bb8266c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x5615fc3e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x55bb8266cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d7629b8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55d7629cac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bc451902b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55c77a3de067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5560e6f60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5615fc3ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c77a3c42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x5610c4e1bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5610c4e022b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bc4518e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5615fc3d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5560e6f50c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55d762a8d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55d7629b9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5615fc3d0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d7629b53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5610c4e0fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5560e6f60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5610c4e08007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5615fc3e1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x55bc451a9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x5615fc4a4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x5560e6f6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x5615fc3e4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5560e6f532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bc451902b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5608a113c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5615fc3ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x55bc451a9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5610c4e19c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5608a1147a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x55bc451a9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x5560e6f6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c77a3c28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5615fc3c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5610c4edc239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5560e6f532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55d7629c0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x5610c4e1c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #59: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x55c77a3ddf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5560e6f60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5560e6f59007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5610c4e022b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5560e6f6ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5608a1137c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5608a1147a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x5615fc3e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d7629b0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55d7629c0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x5560e702d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5610c4e008fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5615fc3ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x5615fc3e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x5615fc3e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #60: PyObject_Call + 0xbc (0x5610c4e1bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5610c4e022b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x5560e6f6d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5560e6f532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x5608a1153f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5608a113a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5560e6f518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x5560e6f6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x5610c4e1b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5560e6f532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x5560e6f6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x5560e6f6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]:Traceback (most recent call last): [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c77a3c42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55c77a3dd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default7]:frame #63: PyObject_Call + 0xbc (0x5610c4e1bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x5608a1153f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55d7629ccf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #63: PyObject_Call + 0xbc (0x55c77a3ddf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d7629b32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #48: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5608a113a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5608a1147a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5608a1140007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5608a1151c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default1]:frame #54: + 0x211239 (0x5608a1214239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]:frame #55: PyObject_Call + 0x207 (0x5608a1154067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5608a113a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #57: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5608a11388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #59: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #60: PyObject_Call + 0xbc (0x5608a1153f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #49: PyObject_Call + 0xbc (0x55d7629ccf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5608a113a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d7629b32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55d7629c0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: output = model(**micro_batch) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d7629b9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55d7629cac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55d762a8d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default4]:frame #55: PyObject_Call + 0x207 (0x55d7629cd067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d7629b32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #57: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d7629b18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]:frame #59: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default4]:frame #60: PyObject_Call + 0xbc (0x55d7629ccf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x5608a1153582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: sharded_logits = self.model( [default1]:frame #63: PyObject_Call + 0xbc (0x5608a1153f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default5]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: recv_activation_tensor = recv_activation() [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d7629b32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55d7629cc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55d7629ccf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: dist.recv( [default6]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f68a5c17d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0x589518e (0x7f68ddbd118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f68ddbcb9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: dist.recv( [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f68ddbcbce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f68ddbccb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f68ddb81f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f68ddb81f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f68ddb81f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f68ddb81f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f68a6dbfc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f68a6dc6c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb1a1585d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f68a6de9b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #12: + 0x5838439 (0x7f68ddb74439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #13: + 0x5843330 (0x7f68ddb7f330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #1: + 0x589518e (0x7fb1d953f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7f68ddb7f3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fb1d95399a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f68dd1c53cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7f68d9d44a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb1d9539ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7f68ddb85a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb1d953ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1d94eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1d94eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #18: + 0x584ed35 (0x7f68ddb8ad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7f68f043ceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #20: + 0x413ea4 (0x7f68efbb8ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1d94eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #21: + 0x1445a6 (0x563ee02175a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x563ee0210a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x563ee0223866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1d94eff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563ee020c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x563ee0217a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x563ee0223f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563ee020a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x563ee0217a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb1a272dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fb1a2734c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fb1a2757b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7fb1d94e2439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7fb1d94ed330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #34: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7fb1d94ed3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563ee020ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #15: + 0x4e893cc (0x7fb1d8b333cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7fb1d56b2a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7fb1d94f3a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x563ee0221c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x563ee02e4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x563ee0210a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7fb1d94f8d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563ee020c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7fb1ebdaaeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x563ee0217a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #20: + 0x413ea4 (0x7fb1eb526ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563ee0207c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x563a993de5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x563a993d7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x563ee0217a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x563ee0223f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563ee020a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x563a993ea866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563a993d3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x563ee0223f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563ee020a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x563a993dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x563ee0217a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x563a993eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563ee0210007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563a993d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x563ee0221c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x563ee02e4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x563ee0224067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x563a993dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563ee020a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563ee02088fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x563ee0223f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563ee020a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x563ee0223582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x563ee0223f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #32: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563a993d6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x563a993e8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x563a994ab239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x563a993d7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563a993d33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x563a993dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563a993cec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x563a993dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x563a993eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563a993d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x563a993eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563a993d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x563a993dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563a993d7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x563a993e8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x563a994ab239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x563a993eb067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563a993d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563a993cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x563a993eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563a993d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x563a993ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x563a993eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0edb4b7d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f0f1347118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f0f1346b9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0f1346bce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0f1346cb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f13421f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f13421f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f13421f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f13421f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0edc65fc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f0edc666c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f0edc689b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f0f13414439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7f0f1341f330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f0f1341f3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f0f12a653cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f0f0f5e4a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f0f13425a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #18: + 0x584ed35 (0x7f0f1342ad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f0f25cdceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f0f25458ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x560966e525a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x560966e4ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x560966e5e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560966e47142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x560966e52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x560966e5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560966e452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x560966e52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560966e4af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x560966e5cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x560966f1f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x560966e4ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560966e473e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x560966e52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560966e42c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x560966e52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x560966e5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560966e452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x560966e5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560966e452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x560966e52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560966e4b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x560966e5cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x560966f1f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x560966e5f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560966e452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560966e438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x560966e5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560966e452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x560966e5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x560966e5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa78e9c9d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7fa7c698318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fa7c697d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa7c697dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa7c697eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7c6933f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7c6933f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7c6933f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7c6933f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa78fb71c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fa78fb78c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fa78fb9bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7fa7c6926439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7fa7c6931330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7fa7c69313c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7fa7c5f773cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7fa7c2af6a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7fa7c6937a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #18: + 0x584ed35 (0x7fa7c693cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #19: + 0xc97eee (0x7fa7d91eeeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #20: + 0x413ea4 (0x7fa7d896aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x5589cdc565a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5589cdc4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x5589cdc62866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5589cdc4b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5589cdc56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x5589cdc62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5589cdc492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5589cdc56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5589cdc4ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5589cdc60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x5589cdd23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5589cdc4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5589cdc4b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5589cdc56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5589cdc46c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5589cdc56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x5589cdc62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5589cdc492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x5589cdc62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5589cdc492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5589cdc56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5589cdc4f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5589cdc60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x5589cdd23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x5589cdc63067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5589cdc492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5589cdc478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x5589cdc62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5589cdc492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x5589cdc62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x5589cdc62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. srun: error: ip-26-0-169-207: task 5: Exited with exit code 1 [default1]:07/06/2024 09:19:59 [WARNING|DP=1|PP=0|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:19:59 [WARNING|DP=0|PP=3|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. srun: error: ip-26-0-169-86: task 2: Exited with exit code 1 [default3]:07/06/2024 09:19:59 [WARNING|DP=1|PP=1|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:19:59 [WARNING|DP=1|PP=3|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:19:59 [WARNING|DP=0|PP=1|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:19:59 [WARNING|DP=0|PP=2|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:19:59 [WARNING|DP=1|PP=2|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: recv_activation_tensor = recv_activation() [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]: return self._call_impl(*args, **kwargs) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2d60450d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0x589518e (0x7f2d9840a18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2d984049a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pipeline_state.run_communication() [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2d98404ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2d98405b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d983baf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d983baf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d983baf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d983baf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2d615f8c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2d615ffc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2d61622b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #12: + 0x5838439 (0x7f2d983ad439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: recv_activation_tensor = recv_activation() [default2]:frame #13: + 0x5843330 (0x7f2d983b8330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f2d983b83c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f2d979fe3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #16: + 0x1a08a88 (0x7f2d9457da88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f2d983bea84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #18: + 0x584ed35 (0x7f2d983c3d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:frame #19: + 0xc97eee (0x7f2daac75eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f2daa3f1ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x559956b435a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x559956b3ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #23: + 0x150866 (0x559956b4f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559956b38142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x559956b43a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x559956b4ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return func(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559956b362b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x559956b43a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f95f6b5bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #1: + 0x589518e (0x7f962eb1518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f962eb0f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #32: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f962eb0fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f962eb10b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f962eac5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559956b3bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x559956b4dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x559956c10239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x559956b3ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559956b383e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f962eac5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x559956b43a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559956b33c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x559956b43a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f962eac5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f962eac5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f95f7d03c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f95f7d0ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f95f7d2db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #12: + 0x5838439 (0x7f962eab8439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #45: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7f962eac3330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #46: PyObject_Call + 0xbc (0x559956b4ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #14: + 0x58433c5 (0x7f962eac33c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f962e1093cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7f962ac88a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7f962eac9a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559956b362b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #18: + 0x584ed35 (0x7f962eaced35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7f9641380eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #20: + 0x413ea4 (0x7f9640afcea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #48: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x559956b4ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559956b362b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x5571504115a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x559956b43a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559956b3c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55715040aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x559956b4dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x559956c10239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x55715041d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x559956b50067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x557150406142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x557150411a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559956b362b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559956b348fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55715041df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5571504042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x559956b4ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x557150411a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559956b362b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x559956b4f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x559956b4ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x557150409f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55715041bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x5571504de239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55715040aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5571504063e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x557150411a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557150401c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x557150411a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55715041df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5571504042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55715041df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5571504042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x557150411a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55715040a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55715041bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x5571504de239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x55715041e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5571504042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5571504028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55715041df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5571504042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55715041d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55715041df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48d5aabd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f490da6518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f490da5f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f490da5fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f490da60b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f490da15f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f490da15f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f490da15f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f490da15f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f48d6c53c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f48d6c5ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f48d6c7db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f490da08439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7f490da13330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f490da133c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f490d0593cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f4909bd8a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f490da19a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #18: + 0x584ed35 (0x7f490da1ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f49202d0eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f491fa4cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x5564ec3935a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5564ec38ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x5564ec39f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5564ec388142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5564ec393a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x5564ec39ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5564ec3862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5564ec393a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5564ec38bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5564ec39dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x5564ec460239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5564ec38ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5564ec3883e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5564ec393a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5564ec383c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5564ec393a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x5564ec39ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5564ec3862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x5564ec39ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5564ec3862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5564ec393a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5564ec38c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5564ec39dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x5564ec460239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x5564ec3a0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5564ec3862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5564ec3848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x5564ec39ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5564ec3862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x5564ec39f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x5564ec39ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f419571bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0x589518e (0x7f41cd6d518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f41cd6cf9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f41cd6cfce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f41cd6d0b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f41cd685f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f41cd685f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f41cd685f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f41cd685f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f41968c3c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f41968cac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f41968edb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #12: + 0x5838439 (0x7f41cd678439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f41cd683330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7f41cd6833c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7f41cccc93cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7f41c9848a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f41cd689a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #18: + 0x584ed35 (0x7f41cd68ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7f41dff40eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7f41df6bcea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #21: + 0x1445a6 (0x55cd28c4b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55cd28c44a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55cd28c57866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55cd28c40142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55cd28c4ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55cd28c57f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55cd28c3e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55cd28c4ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55cd28c43f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55cd28c55c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55cd28d18239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55cd28c44a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55cd28c403e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55cd28c4ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55cd28c3bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55cd28c4ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55cd28c57f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55cd28c3e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55cd28c57f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55cd28c3e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55cd28c4ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55cd28c44007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55cd28c55c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55cd28d18239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x55cd28c58067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55cd28c3e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55cd28c3c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x55cd28c57f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55cd28c3e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55cd28c57582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55cd28c57f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self._call_impl(*args, **kwargs) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: dist.recv( [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbf52433d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0x589518e (0x7fbf8a3ed18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return func(*args, **kwargs) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fbf8a3e79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fbf8a3e7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fbf8a3e8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbf8a39df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbf8a39df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbf8a39df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4779671d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7f47b162b18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f47b16259a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbf8a39df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fbf535dbc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f47b1625ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f47b1626b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fbf535e2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f47b15dbf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fbf53605b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f47b15dbf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f47b15dbf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #12: + 0x5838439 (0x7fbf8a390439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7fbf8a39b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f47b15dbf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fbf8a39b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f477a819c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #15: + 0x4e893cc (0x7fbf899e13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f477a820c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f477a843b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7f47b15ce439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7f47b15d9330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7f47b15d93c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f47b0c1f3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7fbf86560a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7fbf8a3a1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #18: + 0x584ed35 (0x7fbf8a3a6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7fbf9cc58eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #16: + 0x1a08a88 (0x7f47ad79ea88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #20: + 0x413ea4 (0x7fbf9c3d4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #21: + 0x1445a6 (0x5613228f15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5613228eaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x5613228fd866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5613228e6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5613228f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x5613228fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #17: + 0x5849a84 (0x7f47b15dfa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5613228e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5613228f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7f47b15e4d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f47c3e96eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #20: + 0x413ea4 (0x7f47c3612ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x56406f7505a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x56406f749a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x56406f75c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56406f745142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56406f750a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x56406f75cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56406f7432b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56406f750a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5613228e9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5613228fbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5613229be239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5613228eaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5613228e63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5613228f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5613228e1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5613228f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x5613228fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5613228e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56406f748f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x56406f75ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x56406f81d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x56406f749a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x5613228fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5613228e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56406f7453e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5613228f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56406f750a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5613228ea007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5613228fbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5613229be239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x5613228fe067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56406f740c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56406f750a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5613228e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5613228e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x56406f75cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x5613228fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5613228e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56406f7432b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x5613228fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x5613228fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #49: PyObject_Call + 0xbc (0x56406f75cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56406f7432b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56406f750a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56406f749007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x56406f75ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x56406f81d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x56406f75d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56406f7432b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56406f7418fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x56406f75cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56406f7432b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x56406f75c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x56406f75cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [2024-07-06 09:20:01,473] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 52017 closing signal SIGTERM [2024-07-06 09:20:01,473] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 52018 closing signal SIGTERM [2024-07-06 09:20:01,476] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68097 closing signal SIGTERM [2024-07-06 09:20:01,477] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68098 closing signal SIGTERM [2024-07-06 09:20:01,477] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68099 closing signal SIGTERM [2024-07-06 09:20:01,477] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68100 closing signal SIGTERM [2024-07-06 09:20:01,479] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68101 closing signal SIGTERM [2024-07-06 09:20:01,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 449596 closing signal SIGTERM [2024-07-06 09:20:01,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 449597 closing signal SIGTERM [2024-07-06 09:20:01,480] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68102 closing signal SIGTERM [2024-07-06 09:20:01,480] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68103 closing signal SIGTERM [2024-07-06 09:20:01,482] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54536 closing signal SIGTERM [2024-07-06 09:20:01,482] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54537 closing signal SIGTERM [2024-07-06 09:20:02,511] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 52019) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 19 (local_rank: 3) exitcode : 1 (pid: 52020) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 20 (local_rank: 4) exitcode : 1 (pid: 52021) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 21 (local_rank: 5) exitcode : 1 (pid: 52022) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 22 (local_rank: 6) exitcode : 1 (pid: 52023) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 23 (local_rank: 7) exitcode : 1 (pid: 52024) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:20:01 host : ip-26-0-169-132.ec2.internal rank : 18 (local_rank: 2) exitcode : 1 (pid: 52019) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:20:02,609] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 449598) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 27 (local_rank: 3) exitcode : 1 (pid: 449599) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 28 (local_rank: 4) exitcode : 1 (pid: 449600) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 29 (local_rank: 5) exitcode : 1 (pid: 449601) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 449602) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 449603) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:20:01 host : ip-26-0-169-139.ec2.internal rank : 26 (local_rank: 2) exitcode : 1 (pid: 449598) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:20:02,707] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 54538) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 54539) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 54540) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 54541) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 54542) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 54543) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:20:01 host : ip-26-0-168-238.ec2.internal rank : 10 (local_rank: 2) exitcode : 1 (pid: 54538) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-169-132: task 3: Exited with exit code 1 srun: error: ip-26-0-169-139: task 4: Exited with exit code 1 srun: error: ip-26-0-168-238: task 1: Exited with exit code 1 [2024-07-06 09:20:03,499] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 68096) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:20:01 host : ip-26-0-168-120.ec2.internal rank : 0 (local_rank: 0) exitcode : -6 (pid: 68096) error_file: traceback : Signal 6 (SIGABRT) received by PID 68096 ============================================================ srun: error: ip-26-0-168-120: task 0: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.