======================== START TIME: Wed Jul 3 22:37:45 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful fatal: Unable to create '/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/.git/index.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. Job status: RUNNING W0703 22:37:54.885000 139629033473856 torch/distributed/run.py:757] W0703 22:37:54.885000 139629033473856 torch/distributed/run.py:757] ***************************************** W0703 22:37:54.885000 139629033473856 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 22:37:54.885000 139629033473856 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Config: [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: run='%date_%jobid', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: seed=42, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: step=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: consumed_train_samples=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: benchmark_csv_path=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: ignore_sanity_checks=True), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pp=8, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tp=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pp_engine=, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tp_mode=, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tp_linear_async_communication=False, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: expert_parallel_size=1), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: eos_token_id=2, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hidden_act='silu', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hidden_size=2048, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: initializer_range=0.02, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: intermediate_size=4096, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: is_llama_config=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: max_position_embeddings=4096, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_attention_heads=32, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_hidden_layers=24, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_key_value_heads=32, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pad_token_id=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pretraining_tp=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rms_norm_eps=1e-05, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rope_scaling=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rope_theta=10000.0, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tie_word_embeddings=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: use_cache=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: vocab_size=50257), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: dtype=torch.bfloat16, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tokenizer_revision=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tokenizer_max_length=None), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: checkpoint_interval=100000, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: save_initial_state=False, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: resume_checkpoint_path=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: log_level_replica='info', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: iteration_step_info_interval=1), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: train_steps=20, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: micro_batch_size=64, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: batch_accumulation_per_replica=16, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: val_check_interval=-1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: limit_val_batches=0, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: limit_test_batches=0), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: adam_beta1=0.9, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: adam_beta2=0.95, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: torch_adam_is_fused=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: name='adamW'), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: zero_stage=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: weight_decay=0.01, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: clip_grad=1.0, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lr_warmup_steps=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lr_warmup_style='linear', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lr_decay_style='linear', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lr_decay_steps=19, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lr_decay_starting_step=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: min_decay_lr=1e-05)), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: start_training_step=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hf_dataset_splits='train', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hf_dataset_config_name=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: dataset_overwrite_cache=False, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: text_column_name='text'), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: seed=42, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_loading_workers=0))], [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/8_GPUS/dp-1_tp-1_pp-8_mbz-64')), [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: lighteval=None) [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Model Config: [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: eos_token_id=2, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hidden_act='silu', [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: hidden_size=2048, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: initializer_range=0.02, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: intermediate_size=4096, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: is_llama_config=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: max_position_embeddings=4096, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_attention_heads=32, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_hidden_layers=24, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: num_key_value_heads=32, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pad_token_id=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: pretraining_tp=1, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rms_norm_eps=1e-05, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rope_scaling=None, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: rope_theta=10000.0, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: tie_word_embeddings=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: use_cache=True, [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: vocab_size=50257) [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Building model.. [default0]:07/03/2024 22:38:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Setting PP block ranks... [default1]:07/03/2024 22:38:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: Local number of parameters: 126M (240.02MiB) [default1]:07/03/2024 22:38:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default1]:07/03/2024 22:38:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default5]:07/03/2024 22:38:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: Local number of parameters: 126M (240.02MiB) [default5]:07/03/2024 22:38:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default5]:07/03/2024 22:38:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 22:38:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: Local number of parameters: 126M (240.02MiB) [default4]:07/03/2024 22:38:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default4]:07/03/2024 22:38:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 22:38:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-164-207]: Local number of parameters: 126M (240.02MiB) [default2]:07/03/2024 22:38:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 243.03MiB. Peak allocated: 245.06MiB Peak reserved: 262.00MiB [default2]:07/03/2024 22:38:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Total number of parameters: 1.21G (2312.82MiB) [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Local number of parameters: 271M (516.35MiB) [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 520.36MiB. Peak allocated: 522.39MiB Peak reserved: 534.00MiB [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Parametrizing model parameters using StandardParametrizator [default3]:07/03/2024 22:38:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-164-207]: Local number of parameters: 168M (320.03MiB) [default3]:07/03/2024 22:38:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 324.04MiB. Peak allocated: 326.07MiB Peak reserved: 336.00MiB [default6]:07/03/2024 22:38:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-207]: Local number of parameters: 168M (320.03MiB) [default6]:07/03/2024 22:38:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 324.04MiB. Peak allocated: 326.07MiB Peak reserved: 336.00MiB [default6]:07/03/2024 22:38:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default7]:07/03/2024 22:38:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-207]: Local number of parameters: 103M (196.32MiB) [default7]:07/03/2024 22:38:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 196.33MiB. Peak allocated: 196.33MiB Peak reserved: 200.00MiB [default7]:07/03/2024 22:38:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default3]:07/03/2024 22:38:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 22:38:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [ZeRO sharding] DP Rank 0 has 271M out of 271M (100.00%) params' optimizer states [default0]:07/03/2024 22:38:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 22:38:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Using `datasets` library [default0]:07/03/2024 22:38:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 22:38:36 [WARNING|DP=0|PP=0|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [Training Plan] There are 1 training stages [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: [Start training] datetime: 2024-07-03 22:38:38.470128 | mbs: 64 | grad_accum: 16 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 22:38:38 [INFO|DP=0|PP=0|TP=0|ip-26-0-164-207]: Memory usage: 2585.75MiB. Peak allocated 2585.75MiB. Peak reserved: 2602.00MiB [default5]:07/03/2024 22:38:38 [WARNING|DP=0|PP=5|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 22:38:38 [WARNING|DP=0|PP=1|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 22:38:38 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 22:38:38 [WARNING|DP=0|PP=2|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 22:38:38 [WARNING|DP=0|PP=3|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 22:38:38 [WARNING|DP=0|PP=7|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 22:38:38 [WARNING|DP=0|PP=6|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 389, in forward [default0]:[rank0]: .contiguous() [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default7]:[rank7]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc5b24e5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: frame #1: + 0x5b3a23e (0x7fc5ec00223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc5ebffcc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc5ebffcf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc5ebffdfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc5ebfb2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc5ebfb2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc5ebfb2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc5ebfb2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc5b37bf189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc5b37c6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc5b37e5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #12: + 0x5adc309 (0x7fc5ebfa4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #13: + 0x5ae6f10 (0x7fc5ebfaef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #14: + 0x5ae6fa5 (0x7fc5ebfaefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #15: + 0x5124446 (0x7fc5eb5ec446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #16: + 0x1acf4b8 (0x7fc5e7f974b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #17: + 0x5aee004 (0x7fc5ebfb6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #18: + 0x5af36b5 (0x7fc5ebfbb6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #19: + 0xd2631e (0x7fc5feba531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #20: + 0x47def4 (0x7fc5fe2fcef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #21: + 0x1445a6 (0x55b0ee15f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b0ee158a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #23: + 0x150866 (0x55b0ee16b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b0ee154142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b0ee15fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x55b0ee16bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ee1522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b0ee15fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #30: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #32: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #34: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b0ee157f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b0ee169c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #38: + 0x211239 (0x55b0ee22c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b0ee158a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b0ee1543e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b0ee15fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b0ee14fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b0ee15fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #45: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x55b0ee16bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ee1522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #48: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x55b0ee16bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ee1522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b0ee15fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b0ee158007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b0ee169c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #54: + 0x211239 (0x55b0ee22c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #55: _PyObject_MakeTpCall + 0x26b (0x55b0ee158a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55b0ee154c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #57: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ee1508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #59: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x55b0ee16bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ee1522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #62: + 0x150582 (0x55b0ee16b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x55b0ee16bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank3]: pipeline_state.run_communication() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank3]: recv_activation_tensor = recv_activation() [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank3]: dist.recv( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank3]: return func(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank3]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default3]:[rank3]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank3]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2e5b904897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank3]: frame #1: + 0x5b3a23e (0x7f2e9542123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f2e9541bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2e9541bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2e9541cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e953d1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e953d1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e953d1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e953d1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2e5cbde189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2e5cbe5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f2e5cc04978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank3]: frame #12: + 0x5adc309 (0x7f2e953c3309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #13: + 0x5ae6f10 (0x7f2e953cdf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #14: + 0x5ae6fa5 (0x7f2e953cdfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #15: + 0x5124446 (0x7f2e94a0b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #16: + 0x1acf4b8 (0x7f2e913b64b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #17: + 0x5aee004 (0x7f2e953d5004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #18: + 0x5af36b5 (0x7f2e953da6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank3]: frame #19: + 0xd2631e (0x7f2ea7fc431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank3]: frame #20: + 0x47def4 (0x7f2ea771bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank3]: frame #21: + 0x1445a6 (0x55f0446f95a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f0446f2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #23: + 0x150866 (0x55f044705866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f0446ee142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f0446f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #26: PyObject_Call + 0xbc (0x55f044705f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f0446ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f0446f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #30: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #32: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #34: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f0446f1f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f044703c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #38: + 0x211239 (0x55f0447c6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f0446f2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f0446ee3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f0446f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f0446e9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f0446f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #45: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #46: PyObject_Call + 0xbc (0x55f044705f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f0446ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #48: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #49: PyObject_Call + 0xbc (0x55f044705f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f0446ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f0446f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f0446f2007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f044703c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #54: + 0x211239 (0x55f0447c6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #55: PyObject_Call + 0x207 (0x55f044706067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f0446ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #57: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f0446ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #59: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #60: PyObject_Call + 0xbc (0x55f044705f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f0446ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #62: + 0x150582 (0x55f044705582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: frame #63: PyObject_Call + 0xbc (0x55f044705f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank3]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: trainer.train(dataloader) [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: output = model(**micro_batch) [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: pipeline_state.run_communication() [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: recv_activation_tensor = recv_activation() [default6]:[rank6]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank1]: Traceback (most recent call last): [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: pipeline_state.run_communication() [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27c3283897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank4]: pipeline_state.run_communication() [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank1]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: frame #1: + 0x5b3a23e (0x7f27fcda023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f27fcd9ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank6]: dist.recv( [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank1]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f27fcd9af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: return func(*args, **kwargs) [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f27fcd9bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27fcd50371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27fcd50371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27fcd50371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27fcd50371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56c190b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: dist.recv( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: frame #1: + 0x5b3a23e (0x7f56fb42823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: sharded_logits = self.model( [default4]:[rank4]: return func(*args, **kwargs) [default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f27c455d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f27c4564610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f27c4583978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #12: + 0x5adc309 (0x7f27fcd42309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #13: + 0x5ae6f10 (0x7f27fcd4cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f56fb422c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f41502bd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f56fb422f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #14: + 0x5ae6fa5 (0x7f27fcd4cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f56fb423fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f56fb3d8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #1: + 0x5b3a23e (0x7f4189dda23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #15: + 0x5124446 (0x7f27fc38a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #16: + 0x1acf4b8 (0x7f27f8d354b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f4189dd4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f56fb3d8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #17: + 0x5aee004 (0x7f27fcd54004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4189dd4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: return forward_call(*args, **kwargs) [default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f56fb3d8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f56fb3d8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #18: + 0x5af36b5 (0x7f27fcd596b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4189dd5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4189d8a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f56c2be5189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #19: + 0xd2631e (0x7f280f94331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #20: + 0x47def4 (0x7f280f09aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4189d8a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f56c2bec610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f56c2c0b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: frame #21: + 0x1445a6 (0x55a05da6c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a05da65a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4189d8a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #12: + 0x5adc309 (0x7f56fb3ca309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #23: + 0x150866 (0x55a05da78866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4189d8a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #13: + 0x5ae6f10 (0x7f56fb3d4f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a05da61142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a05da6ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x55a05da78f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4151597189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a05da5f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #14: + 0x5ae6fa5 (0x7f56fb3d4fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #15: + 0x5124446 (0x7f56faa12446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a05da6ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #16: + 0x1acf4b8 (0x7f56f73bd4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #30: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: frame #17: + 0x5aee004 (0x7f56fb3dc004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #18: + 0x5af36b5 (0x7f56fb3e16b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f415159e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f41515bd978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #32: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default4]:[rank4]: frame #12: + 0x5adc309 (0x7f4189d7c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #13: + 0x5ae6f10 (0x7f4189d86f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #19: + 0xd2631e (0x7f570dfcb31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #20: + 0x47def4 (0x7f570d722ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: frame #34: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #14: + 0x5ae6fa5 (0x7f4189d86fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a05da64f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #15: + 0x5124446 (0x7f41893c4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #16: + 0x1acf4b8 (0x7f4185d6f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a05da76c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #21: + 0x1445a6 (0x55b7f1a145a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #17: + 0x5aee004 (0x7f4189d8e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #18: + 0x5af36b5 (0x7f4189d936b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank1]: pipeline_state.run_communication() [default4]:[rank4]: frame #19: + 0xd2631e (0x7f419c97d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b7f1a0da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #23: + 0x150866 (0x55b7f1a20866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank1]: recv_activation_tensor = recv_activation() [default4]:[rank4]: frame #20: + 0x47def4 (0x7f419c0d4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #21: + 0x1445a6 (0x55ba722fb5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #38: + 0x211239 (0x55a05db39239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a05da65a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a05da613e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55ba722f4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a05da6ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a05da5cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #23: + 0x150866 (0x55ba72307866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ba722f0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b7f1a09142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55ba722fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x55ba72307f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a05da6ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b7f1a14a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x55b7f1a20f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #45: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x55a05da78f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ba722ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7f1a072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b7f1a14a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a05da5f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: dist.recv( [default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55ba722fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #30: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #48: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank1]: return func(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: frame #32: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x55a05da78f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #30: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a05da5f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a05da6ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #34: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a05da65007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #32: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank1]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ba722f3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55ba72305c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #38: + 0x211239 (0x55ba723c8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff15f476897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank1]: frame #1: + 0x5b3a23e (0x7ff198f9323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a05da76c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #54: + 0x211239 (0x55a05db39239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55ba722f4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x55a05da79067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #34: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ba722f03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55ba722fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7ff198f8dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff198f8df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ba722ebc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55ba722fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b7f1a0cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a05da5f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #57: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b7f1a1ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #38: + 0x211239 (0x55b7f1ae1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #45: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff198f8efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff198f43371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x55ba72307f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ba722ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a05da5d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff198f43371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff198f43371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff198f43371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b7f1a0da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b7f1a093e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #48: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b7f1a14a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff160750189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank1]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7ff160757610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b7f1a04c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b7f1a14a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #59: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x55a05da78f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a05da5f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x55ba72307f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ba722ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #62: + 0x150582 (0x55a05da78582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x55a05da78f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank1]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7ff160776978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank1]: frame #12: + 0x5adc309 (0x7ff198f35309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #13: + 0x5ae6f10 (0x7ff198f3ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55ba722fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ba722f4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #45: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x55b7f1a20f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55ba72305c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7f1a072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #48: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #14: + 0x5ae6fa5 (0x7ff198f3ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #15: + 0x5124446 (0x7ff19857d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #54: + 0x211239 (0x55ba723c8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x55ba72308067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #16: + 0x1acf4b8 (0x7ff194f284b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #17: + 0x5aee004 (0x7ff198f47004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x55b7f1a20f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7f1a072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #18: + 0x5af36b5 (0x7ff198f4c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank1]: frame #19: + 0xd2631e (0x7ff1abb3631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ba722ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #57: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b7f1a14a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b7f1a0d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ba722ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #20: + 0x47def4 (0x7ff1ab28def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank1]: frame #21: + 0x1445a6 (0x561e212645a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #59: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x55ba72307f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b7f1a1ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #54: + 0x211239 (0x55b7f1ae1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x55b7f1a21067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561e2125da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ba722ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7f1a072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #23: + 0x150866 (0x561e21270866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #62: + 0x150582 (0x55ba72307582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x55ba72307f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #57: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank1]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561e21259142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561e21264a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #26: PyObject_Call + 0xbc (0x561e21270f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561e212572b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561e21264a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b7f1a058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #59: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #30: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #32: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #34: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561e2125cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x55b7f1a20f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7f1a072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #62: + 0x150582 (0x55b7f1a20582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561e2126ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x55b7f1a20f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #38: + 0x211239 (0x561e21331239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank1]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561e2125da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561e212593e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561e21264a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561e21254c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561e21264a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #45: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #46: PyObject_Call + 0xbc (0x561e21270f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561e212572b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #48: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #49: PyObject_Call + 0xbc (0x561e21270f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561e212572b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561e21264a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561e2125d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561e2126ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #54: + 0x211239 (0x561e21331239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #55: PyObject_Call + 0x207 (0x561e21271067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561e212572b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #57: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561e212558fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #59: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #60: PyObject_Call + 0xbc (0x561e21270f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561e212572b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #62: + 0x150582 (0x561e21270582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: frame #63: PyObject_Call + 0xbc (0x561e21270f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank2]: pipeline_state.run_communication() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank2]: recv_activation_tensor = recv_activation() [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank2]: dist.recv( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank2]: return func(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank2]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default2]:[rank2]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank2]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3022186897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank2]: frame #1: + 0x5b3a23e (0x7f305bca323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f305bc9dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f305bc9df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f305bc9efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f305bc53371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f305bc53371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f305bc53371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f305bc53371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3023460189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f3023467610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f3023486978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank2]: frame #12: + 0x5adc309 (0x7f305bc45309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #13: + 0x5ae6f10 (0x7f305bc4ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #14: + 0x5ae6fa5 (0x7f305bc4ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #15: + 0x5124446 (0x7f305b28d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #16: + 0x1acf4b8 (0x7f3057c384b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #17: + 0x5aee004 (0x7f305bc57004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #18: + 0x5af36b5 (0x7f305bc5c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank2]: frame #19: + 0xd2631e (0x7f306e84631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank2]: frame #20: + 0x47def4 (0x7f306df9def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank2]: frame #21: + 0x1445a6 (0x55f5c93ed5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f5c93e6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #23: + 0x150866 (0x55f5c93f9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f5c93e2142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f5c93eda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #26: PyObject_Call + 0xbc (0x55f5c93f9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f5c93e02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f5c93eda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #30: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #32: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #34: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f5c93e5f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f5c93f7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #38: + 0x211239 (0x55f5c94ba239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f5c93e6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f5c93e23e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f5c93eda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f5c93ddc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f5c93eda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #45: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #46: PyObject_Call + 0xbc (0x55f5c93f9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f5c93e02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #48: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #49: PyObject_Call + 0xbc (0x55f5c93f9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f5c93e02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f5c93eda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f5c93e6007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f5c93f7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #54: + 0x211239 (0x55f5c94ba239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #55: PyObject_Call + 0x207 (0x55f5c93fa067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f5c93e02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #57: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f5c93de8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #59: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #60: PyObject_Call + 0xbc (0x55f5c93f9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f5c93e02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #62: + 0x150582 (0x55f5c93f9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: frame #63: PyObject_Call + 0xbc (0x55f5c93f9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank2]: . This may indicate a possible application crash on rank 0 or a network set up issue. W0703 22:38:45.061000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604171 closing signal SIGTERM W0703 22:38:45.061000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604172 closing signal SIGTERM W0703 22:38:45.062000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604173 closing signal SIGTERM W0703 22:38:45.062000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604174 closing signal SIGTERM W0703 22:38:45.063000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604175 closing signal SIGTERM W0703 22:38:45.063000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604176 closing signal SIGTERM W0703 22:38:45.064000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 604177 closing signal SIGTERM E0703 22:38:46.979000 139629033473856 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 604170) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_22:38:45 host : ip-26-0-164-207.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 604170) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-164-207: task 0: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.