======================== START TIME: Tue Jul 2 19:52:07 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0702 19:52:10.271000 139929343477568 torch/distributed/run.py:757] W0702 19:52:10.271000 139929343477568 torch/distributed/run.py:757] ***************************************** W0702 19:52:10.271000 139929343477568 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 19:52:10.271000 139929343477568 torch/distributed/run.py:757] ***************************************** W0702 19:52:10.275000 140024377644864 torch/distributed/run.py:757] W0702 19:52:10.275000 140024377644864 torch/distributed/run.py:757] ***************************************** W0702 19:52:10.275000 140024377644864 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 19:52:10.275000 140024377644864 torch/distributed/run.py:757] ***************************************** [default0]:07/02/2024 19:52:27 [WARNING|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Config: [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: run='%date_%jobid', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: seed=42, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: step=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: consumed_train_samples=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: benchmark_csv_path=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: ignore_sanity_checks=True), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: parallelism=ParallelismArgs(dp=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pp=4, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp=4, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pp_engine=, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp_mode=, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp_linear_async_communication=False, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: expert_parallel_size=1), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: eos_token_id=2, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_act='silu', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_size=2048, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: initializer_range=0.02, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: intermediate_size=4096, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: is_llama_config=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: max_position_embeddings=4096, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_attention_heads=32, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_hidden_layers=24, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_key_value_heads=32, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pad_token_id=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pretraining_tp=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rms_norm_eps=1e-05, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_scaling=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_theta=10000.0, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tie_word_embeddings=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: use_cache=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: vocab_size=50260), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: init_method=RandomInit(std=0.025), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dtype=torch.bfloat16, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: make_vocab_size_divisible_by=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: ddp_bucket_cap_mb=25), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer_revision=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer_max_length=None), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoint_interval=100000, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: save_initial_state=False, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: resume_checkpoint_path=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoints_path_is_shared_file_system=False), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: logging=LoggingArgs(log_level='info', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: log_level_replica='info', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: iteration_step_info_interval=1), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokens=TokensArgs(sequence_length=4096, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: train_steps=20, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: micro_batch_size=128, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: batch_accumulation_per_replica=8, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: val_check_interval=-1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: limit_val_batches=0, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: limit_test_batches=0), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: adam_beta1=0.9, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: adam_beta2=0.95, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: torch_adam_is_fused=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: name='adamW'), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: zero_stage=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: weight_decay=0.01, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: clip_grad=1.0, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: accumulate_grad_in_fp32=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_warmup_steps=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_warmup_style='linear', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_style='linear', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_steps=19, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_starting_step=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: min_decay_lr=1e-05)), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: start_training_step=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hf_dataset_splits='train', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hf_dataset_config_name=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dataset_processing_num_proc_per_process=64, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dataset_overwrite_cache=False, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: text_column_name='text'), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: seed=42, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_loading_workers=32))], [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/16_GPUS/dp-1_tp-4_pp-4_mbz-128')), [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lighteval=None) [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Model Config: [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: LlamaConfig(bos_token_id=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: eos_token_id=2, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_act='silu', [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_size=2048, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: initializer_range=0.02, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: intermediate_size=4096, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: is_llama_config=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: max_position_embeddings=4096, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_attention_heads=32, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_hidden_layers=24, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_key_value_heads=32, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pad_token_id=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pretraining_tp=1, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rms_norm_eps=1e-05, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_scaling=None, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_theta=10000.0, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tie_word_embeddings=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: use_cache=True, [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: vocab_size=50260) [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Building model.. [default0]:07/02/2024 19:52:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Setting PP block ranks... [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-139]: Local number of parameters: 73.4M (140.05MiB) [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-139]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=1|ip-26-0-169-139]: No checkpoint path provided. [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-139]: Local number of parameters: 73.4M (140.05MiB) [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-139]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=3|ip-26-0-169-139]: No checkpoint path provided. [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-139]: Local number of parameters: 73.4M (140.05MiB) [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-139]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=2|ip-26-0-169-139]: No checkpoint path provided. [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: Local number of parameters: 99.2M (189.14MiB) [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: No checkpoint path provided. [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-139]: Local number of parameters: 73.4M (140.05MiB) [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=1|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Total number of parameters: 1.21G (2313.42MiB) [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Local number of parameters: 99.2M (189.14MiB) [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Parametrizing model parameters using StandardParametrizator [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: Local number of parameters: 99.2M (189.14MiB) [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: No checkpoint path provided. [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: Local number of parameters: 99.2M (189.14MiB) [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: No checkpoint path provided. [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=0|ip-26-0-172-73]: Local number of parameters: 62.9M (120.05MiB) [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=0|ip-26-0-172-73]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default0]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=0|ip-26-0-172-73]: No checkpoint path provided. [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=2|ip-26-0-172-73]: Local number of parameters: 62.9M (120.05MiB) [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=2|ip-26-0-172-73]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=2|ip-26-0-172-73]: Local number of parameters: 67.7M (129.12MiB) [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=2|ip-26-0-172-73]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default2]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=2|ip-26-0-172-73]: No checkpoint path provided. [default6]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=2|ip-26-0-172-73]: No checkpoint path provided. [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=3|ip-26-0-172-73]: Local number of parameters: 62.9M (120.05MiB) [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=3|ip-26-0-172-73]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default3]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=3|ip-26-0-172-73]: No checkpoint path provided. [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=3|ip-26-0-172-73]: Local number of parameters: 67.7M (129.12MiB) [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=3|ip-26-0-172-73]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default7]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=3|ip-26-0-172-73]: No checkpoint path provided. [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=1|ip-26-0-172-73]: Local number of parameters: 62.9M (120.05MiB) [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=1|ip-26-0-172-73]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default1]:07/02/2024 19:52:42 [INFO|DP=0|PP=2|TP=1|ip-26-0-172-73]: No checkpoint path provided. [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=1|ip-26-0-172-73]: Local number of parameters: 67.7M (129.12MiB) [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=1|ip-26-0-172-73]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=0|ip-26-0-172-73]: Local number of parameters: 67.7M (129.12MiB) [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=0|ip-26-0-172-73]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default5]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=1|ip-26-0-172-73]: No checkpoint path provided. [default4]:07/02/2024 19:52:42 [INFO|DP=0|PP=3|TP=0|ip-26-0-172-73]: No checkpoint path provided. [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [ZeRO sharding] DP Rank 0 has 99.2M out of 99.2M (100.00%) params' optimizer states [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Using `datasets` library [default0]:07/02/2024 19:52:44 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/02/2024 19:52:45 [WARNING|DP=0|PP=0|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Training Plan] There are 1 training stages [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Stage Training Stage] start from step 1 [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Start training] datetime: 2024-07-02 19:52:45.851132 | mbs: 128 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/02/2024 19:52:45 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Memory usage: 953.61MiB. Peak allocated 953.61MiB. Peak reserved: 960.00MiB [default5]:07/02/2024 19:52:46 [WARNING|DP=0|PP=1|TP=1|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 19:52:45 [WARNING|DP=0|PP=1|TP=2|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 19:52:46 [WARNING|DP=0|PP=1|TP=3|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 19:52:46 [WARNING|DP=0|PP=1|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 19:52:45 [WARNING|DP=0|PP=0|TP=1|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 19:52:46 [WARNING|DP=0|PP=2|TP=0|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 19:52:46 [WARNING|DP=0|PP=2|TP=1|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 19:52:45 [WARNING|DP=0|PP=2|TP=3|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 19:52:46 [WARNING|DP=0|PP=2|TP=2|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 19:52:46 [WARNING|DP=0|PP=3|TP=1|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 19:52:46 [WARNING|DP=0|PP=3|TP=0|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 19:52:46 [WARNING|DP=0|PP=3|TP=3|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 19:52:46 [WARNING|DP=0|PP=3|TP=2|ip-26-0-172-73]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 19:52:46 [WARNING|DP=0|PP=0|TP=2|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 19:52:46 [WARNING|DP=0|PP=0|TP=3|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:[rank2]: Traceback (most recent call last): [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: output = model(**micro_batch) [default3]:[rank3]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 563, in forward [default2]:[rank2]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.83 GiB memory in use. Of the allocated memory 69.56 GiB is allocated by PyTorch, and 929.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 563, in forward [default3]:[rank3]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 877.94 MiB is free. Including non-PyTorch memory, this process has 78.46 GiB memory in use. Of the allocated memory 69.56 GiB is allocated by PyTorch, and 929.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 563, in forward [default0]:[rank0]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 563, in forward [default1]:[rank1]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 579.94 MiB is free. Including non-PyTorch memory, this process has 78.75 GiB memory in use. Of the allocated memory 69.56 GiB is allocated by PyTorch, and 929.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel[default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank12]: recv_activation_tensor = recv_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank12]: dist.recv( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]:[rank12]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank12]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f90b00d3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank12]: frame #1: + 0x5b3a23e (0x7f90e9bf023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f90e9beac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f90e9beaf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f90e9bebfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f90e9ba0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f90e9ba0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packa/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) ges/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f90e9ba0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f90e9ba0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f90b13ad189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f90b13b4610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #11: c10d::ProcessGroupNCCL::[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank7]: frame #recv(std::vector >&, int, int) + 0x5f8 (0x7f90b13d3978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #12: + 0x5adc309 (0x7f90e9b92309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #13: + 0x5ae6f10 (0x7f90e9b9cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #14: + 0x5ae6fa5 (0x7f90e9b9cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #15: + 0x5124446 (0x7f90e91da446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #16: + 0x1ac0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1950785897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: frame #1: + 0x5b3a23e (0x7f198a2a223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f198a29cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f198a29cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f198a29dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/tf4b8 (0x7f90e5b854b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #17: + 0x5aee004 (0x7f90e9ba4004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #18: + 0x5af36b5 (0x7f90e9ba96b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #19: + 0xd2631e (0x7f90fc79331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank12]: frame #20: + 0x47def4 (0x7f90fbeeaef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank12]: frame #21: + 0x1445a6 (0x563b360ac5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluorch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f198a252371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f198a252371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f198a252371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f198a252371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1951a5f189 in /fsx/ferdinandmom/minister/bin/python3.10) [default4]:[rank12]: frame #22: _PyObject_MakeTpCall + 0x26b (0x563b360a5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #23: + 0x150866 (0x563b360b8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563b360a1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563b360aca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #26: PyObject_Call + 0xbc (0x563b360b8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563b3609f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563b360aca2c in /fsx/ferdinandmom/miniforge3/eforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1951a66610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1951a85978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #12: + 0x5adc309 (0x7f198a244309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #13: + 0x5ae6f10 (0x7f198a24ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #14: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563b3609d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #32: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563b3609d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #34: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563b3609d8fown function> + 0x5ae6fa5 (0x7f198a24efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #15: + 0x5124446 (0x7f198988c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #16: + 0x1acf4b8 (0x7f19862374b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #17: + 0x5aee004 (0x7f198a256004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #18: + 0x5af36b5 (0x7f198a25b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #19: + 0xd2631e (0x7f199ce4531e in /fsx/ferdinandmom/miniforge3/envs/ea in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563b360a4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #37: _PyObject_Call_Prepend + 0x69 (0x563b360b6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #38: + 0x211239 (0x563b36179239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #39: _PyObject_MakeTpCall + 0x26b (0x563b360a5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563b360a13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563b360aca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #42: _PyEval_EvalFramnv-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #20: + 0x47def4 (0x7f199c59cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #21: + 0x1445a6 (0x555cf03b75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555cf03b0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #23: + 0x150866 (0x555cf03c3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555cf03ac142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555cf03b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[raeDefault + 0x72c (0x563b3609cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563b360aca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563b3609d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #45: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #46: PyObject_Call + 0xbc (0x563b360b8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563b3609f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #48: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame nk7]: frame #26: PyObject_Call + 0xbc (0x555cf03c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf03aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555cf03b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #30: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #32: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)#49: PyObject_Call + 0xbc (0x563b360b8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563b3609f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563b360aca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563b360a5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #53: _PyObject_Call_Prepend + 0x69 (0x563b360b6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #54: + 0x211239 (0x563b36179239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #55: PyObject_Call + 0x207 (0x563b360b9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[ra [default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #34: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555cf03aff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) nk12]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563b3609f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #57: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563b3609d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #59: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #60: PyObject_Call + 0xbc (0x563b360b8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563b3609f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #62: + 0x150582 (0x563b360b8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyth[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555cf03c1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #38: + 0x211239 (0x555cf0484239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555cf03b0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555cf03ac3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555cf03b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555cf03a7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555cf03b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-clusteon3.10) [default4]:[rank12]: frame #63: PyObject_Call + 0xbc (0x563b360b8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallelr/bin/python3.10) [default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #45: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x555cf03c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf03aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #48: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x555cf03c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits [default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf03aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555cf03b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555cf03b0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555cf03c1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #54: + 0x211239 (0x555cf0484239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x555cf03c4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf03aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-clust= self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/eer/bin/python3.10) [default7]:[rank7]: frame #57: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) nvs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, [default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555cf03a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #59: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x555cf03c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf03aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #62: + 0x150582 (0x555cf03c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x555cf03c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) in run_communication [default2]:[rank10]: recv_activation_tensor = recv_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[r[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. ank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default2]:[rank10]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank10]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b66cbd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-package[default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step s/torch/lib/libc10.so) [default2]:[rank10]: frame #1: + 0x5b3a23e (0x7f2ba07da23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f2ba07d4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2ba07d4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2ba07d5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2ba078a371 in /fsx/ferdinandmom/miniforge3/e[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) nvs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2ba078a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2ba078a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2ba078a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2b67f97189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #10: c10d::ProcessGroupNCCL::getNCCLC[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) omm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2b67f9e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f2b67fbd978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #12: + 0x5adc309 (0x7f2ba077c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #13: + 0x5ae6f10 (0x7f2ba0786f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #14: + 0x5ae6fa5 (0x7f2ba0786fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_c[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] pu.so) [default2]:[rank10]: frame #15: + 0x5124446 (0x7f2b9fdc4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #16: + 0x1acf4b8 (0x7f2b9c76f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #17: + 0x5aee004 (0x7f2ba078e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #18: + 0x5af36b5 (0x7f2ba07936b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #19: + 0xd2631e (0x7f2bb337d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #20: + 0x47def4 (0x[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) 7f2bb2ad4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #21: + 0x1445a6 (0x55923223c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559232235a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #23: + 0x150866 (0x559232248866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559232231142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55923223ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #26: PyObject_Call + 0xbc (0x559232248f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward ]:[rank10]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55923222f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55923223ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #30: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #32: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-benc[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: pipeline_state.run_communication() h-cluster/bin/python3.10) [default2]:[rank10]: frame #34: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559232234f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559232246c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #38: + 0x211239 (0x559232309239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559232235a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5592322313e6 in /fsx/ferdi[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] nandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55923223ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55923222cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55923223ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #45: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #46: PyObject_Call + 0xbc (0x559232248f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55923[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors 222f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #48: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #49: PyObject_Call + 0xbc (0x559232248f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55923222f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55923223ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559232235007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559232246c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #54: + 0x211239 (0x559232309239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #55: PyObject_Call + 0x207 (0x559232249067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55923222f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #57: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55923222d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #59: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #60: PyObject_Call + 0xbc (0x559232248f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #61: _PyEval[default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() _EvalFrameDefault + 0x2d83 (0x55923222f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #62: + 0x150582 (0x559232248582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #63: PyObject_Call + 0xbc (0x559232248f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench[default6]:[rank6]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): _cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_i[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fecb7104897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank6]: frame #1: + 0x5b3a23e (0x7fecf0c2123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fecf0c1bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) mpl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clus[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fecf0c1bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fecf0c1cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fecf0bd1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ter/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pip[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fecf0bd1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fecf0bd1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fecf0bd1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fecb83de189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) eline_state_buffer [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = s[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fecb83e5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fecb8404978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #12: + 0x5adc309 (0x7fecf0bc3309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) elf._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank9]: dist.recv( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default1]:[rank9]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most re[default6]:[rank6]: frame #13: + 0x5ae6f10 (0x7fecf0bcdf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) cent call first): [default1]:[rank9]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9e4e16b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank9]: frame #1: + 0x5b3a23e (0x7f9e87c8823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank1[default6]:[rank6]: frame #14: + 0x5ae6fa5 (0x7fecf0bcdfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #15: + 0x5124446 (0x7fecf020b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #16: + 0x1acf4b8 (0x7fececbb64b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 1]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[ra[default6]:[rank6]: frame #17: + 0x5aee004 (0x7fecf0bd5004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #18: + 0x5af36b5 (0x7fecf0bda6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #19: + 0xd2631e (0x7fed037c431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) nk11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_[default6]:[rank6]: frame #20: + 0x47def4 (0x7fed02f1bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #21: + 0x1445a6 (0x55c4cd00b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c4cd004a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #23: + 0x150866 (0x55c4cd017866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c4cd000142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c4cd00ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x55c4cd017f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c4ccffe2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) _communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c4cd00ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #30: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank11]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]:[rank11]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank11]: frame [default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #32: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #34: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcd49578897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank11]: frame #1: + 0x5b3a23e (0x7fcd8309523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fcd8308fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcd8308ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcd83090fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packa[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c4cd003f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c4cd015c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #38: + 0x211239 (0x55c4cd0d8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ges/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcd83045371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcd83045371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcd83045371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcd83045371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcd4a852189 in /fsx/ferdina[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c4cd004a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c4cd0003e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c4cd00ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c4ccffbc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ndmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fcd4a859610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fcd4a878978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #12: + 0x5adc309 (0x7fcd83037309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #13: + 0x5ae6f10 (0x7fcd83041f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: f[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c4cd00ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #45: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x55c4cd017f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) rame #14: + 0x5ae6fa5 (0x7fcd83041fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #15: + 0x5124446 (0x7fcd8267f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #16: + 0x1acf4b8 (0x7fcd7f02a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #17: + 0x5aee004 (0x7fcd83049004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #18: + 0x5af36b5 (0x7fcd8304e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #19: + 0xd2631e (0x7fcd95c3831e in /fsx/ferdinandm[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c4ccffe2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #48: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x55c4cd017f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c4ccffe2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) om/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #20: + 0x47def4 (0x7fcd9538fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #21: + 0x1445a6 (0x56253be105a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56253be09a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #23: + 0x150866 (0x56253be1c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56253be05142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56253be10a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c4cd00ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c4cd004007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c4cd015c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #54: + 0x211239 (0x55c4cd0d8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) python3.10) [default3]:[rank11]: frame #26: PyObject_Call + 0xbc (0x56253be1cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56253be032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56253be10a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #30: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #32: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x55c4cd018067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c4ccffe2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #57: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c4ccffc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #34: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56253be08f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #59: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x55c4cd017f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c4ccffe2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #62: + 0x150582 (0x55c4cd017582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56253be1ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x55c4cd017f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank11]: frame #38: + 0x211239 (0x56253bedd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: Traceback (most recent call last): [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f9e87c82c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56253be09a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56253be053e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56253be10a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9e87c82f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56253be00c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56253be10a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default3]:[rank11]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default1]:[rank9]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9e87c83fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: frame #45: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: frame #46: PyObject_Call + 0xbc (0x56253be1cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9e87c38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56253be032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank4]: dist.recv( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f65ef22b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank4]: frame #1: + 0x5b3a23e (0x7f6628d4823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9e87c38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9e87c38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9e87c38371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6628d42c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6628d42f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6628d43fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6628cf8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9e4f445189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6628cf8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6628cf8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6628cf8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f9e4f44c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #48: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f65f0505189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f65f050c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f65f052b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f9e4f46b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #49: PyObject_Call + 0xbc (0x56253be1cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #12: + 0x5adc309 (0x7f6628cea309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #13: + 0x5ae6f10 (0x7f6628cf4f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: frame #14: + 0x5ae6fa5 (0x7f6628cf4fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #15: + 0x5124446 (0x7f6628332446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #16: + 0x1acf4b8 (0x7f6624cdd4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #17: + 0x5aee004 (0x7f6628cfc004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #18: + 0x5af36b5 (0x7f6628d016b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #19: + 0xd2631e (0x7f663b8eb31e in[default1]:[rank9]: frame #12: + 0x5adc309 (0x7f9e87c2a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56253be032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #20: + 0x47def4 (0x7f663b042ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: frame #21: + 0x1445a6 (0x558ca5b865a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x558ca5b7fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #23: + 0x150866 (0x558ca5b92866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x558ca5b7b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x558ca5b86a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #13: + 0x5ae6f10 (0x7f9e87c34f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x558ca5b92f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x558ca5b792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x558ca5b86a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: frame #30: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #32: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #34: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558ca5b7ef50 in /fsx/ferdinandmom/miniforge3/envs/env-[default1]:[rank9]: frame #14: + 0x5ae6fa5 (0x7f9e87c34fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #15: + 0x5124446 (0x7f9e87272446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) bench-cluster/bin/python3.10) [default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x558ca5b90c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #38: + 0x211239 (0x558ca5c53239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x558ca5b7fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x558ca5b7b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x558ca5b86a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x558ca5b76c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x558ca5b86a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56253be10a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #45: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x558ca5b92f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x558ca5b792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: frame #48: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x558ca5b92f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x558ca5b792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x558ca5b86a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #16: + 0x1acf4b8 (0x7f9e83c1d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558ca5b7f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x558ca5b90c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #54: + 0x211239 (0x558ca5c53239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x558ca5b93067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x558ca5b792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #57: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x558ca5b778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56253be09007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #59: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x558ca5b92f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x558ca5b792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #62: + 0x150582 (0x558ca5b92582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return forward_call(*args, **kwargs) [default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x558ca5b92f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank9]: frame #17: + 0x5aee004 (0x7f9e87c3c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56253be1ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #54: + 0x211239 (0x56253bedd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/[default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipelin[default3]:[rank11]: frame #55: PyObject_Call + 0x207 (0x56253be1d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #18: + 0x5af36b5 (0x7f9e87c416b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) e_state_buffer [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: recv_activation_tensor = recv_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: torch.d[default3]:[rank11]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56253be032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) istributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f72e48d5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank5]: frame #1: + 0x5b3a23e (0x7f731e3f223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f731e3ecc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f731e3ecf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f731e3edfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f731e3a2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f731e3a2371 in /fsx/ferdinandmom/miniforge3/envs/env-b[default1]:[rank9]: frame #19: + 0xd2631e (0x7f9e9a82b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) ench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f731e3a2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f731e3a2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f72e5baf189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f72e5bb6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) locator >&, int, int) + 0x5f8 (0x7f72e5bd5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56253be018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #12: + 0x5adc309 (0x7f731e394309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #13: + 0x5ae6f10 (0x7f731e39ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: return forward_call(*args, **kwargs) [default5]:[rank5]: frame #14: + 0x5ae6fa5 (0x7f731e39efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #15: + 0x5124446 (0x7f731d9dc446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: frame #16: + 0x1acf4b8 (0x7f731a3874b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #17: + 0x5aee004 (0x7f731e3a6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #18: + 0x5af36b5 (0x7f731e3ab6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #19: + 0xd2631e (0x7f7330f9531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #59: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #20: + 0x47def4 (0x7f73306ecef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #21: + 0x1445a6 (0x55b1ad8935a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b1ad88ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #23: + 0x150866 (0x55b1ad89f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b1ad888142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #60: PyObject_Call + 0xbc (0x56253be1cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b1ad893a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x55b1ad89ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1ad8862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b1ad893a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #20: + 0x47def4 (0x7f9e99f82ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56253be032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #62: + 0x150582 (0x56253be1c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #30: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #32: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #21: + 0x1445a6 (0x55fe4426f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #34: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b1ad88bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b1ad89dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: frame #63: PyObject_Call + 0xbc (0x56253be1cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #38: + 0x211239 (0x55b1ad960239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b1ad88ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b1ad8883e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b1ad893a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b1ad883c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b1ad893a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #45: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x55b1ad89ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1ad8862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #48: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x55b1ad89ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1ad8862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b1ad893a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55fe44268a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b1ad88c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b1ad89dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #54: + 0x211239 (0x55b1ad960239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x55b1ad8a0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: pipeline_state.run_communication() [default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1ad8862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #57: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b1ad8848fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #59: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #23: + 0x150866 (0x55fe4427b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x55b1ad89ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1ad8862b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #62: + 0x150582 (0x55b1ad89f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x55b1ad89ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55fe44264142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55fe4426fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: recv_activation_tensor = recv_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: frame #26: PyObject_Call + 0xbc (0x55fe4427bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55fe442622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55fe4426fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #30: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #32: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #34: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55fe44267f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55fe44279c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #38: + 0x211239 (0x55fe4433c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55fe44268a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55fe442643e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55fe4426fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55fe4425fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55fe4426fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #45: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #46: PyObject_Call + 0xbc (0x55fe4427bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank8]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55fe442622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #48: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: frame #49: PyObject_Call + 0xbc (0x55fe4427bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55fe442622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55fe4426fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank8]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank9]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55fe44268007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bc66ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank8]: frame #1: + 0x5b3a23e (0x7f8c001c823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55fe44279c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f8c001c2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8c001c2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8c001c3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8c00178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8c00178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #54: + 0x211239 (0x55fe4433c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8c00178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8c00178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8bc7985189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8bc798c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8bc79ab978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #55: PyObject_Call + 0x207 (0x55fe4427c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55fe442622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #57: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #12: + 0x5adc309 (0x7f8c0016a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55fe442608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #59: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #60: PyObject_Call + 0xbc (0x55fe4427bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55fe442622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #62: + 0x150582 (0x55fe4427b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #13: + 0x5ae6f10 (0x7f8c00174f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #63: PyObject_Call + 0xbc (0x55fe4427bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank8]: frame #14: + 0x5ae6fa5 (0x7f8c00174fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #15: + 0x5124446 (0x7f8bff7b2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #16: + 0x1acf4b8 (0x7f8bfc15d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #17: + 0x5aee004 (0x7f8c0017c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #18: + 0x5af36b5 (0x7f8c001816b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #19: + 0xd2631e (0x7f8c12d6b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: frame #20: + 0x47def4 (0x7f8c124c2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: frame #21: + 0x1445a6 (0x556d57b375a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #22: _PyObject_MakeTpCall + 0x26b (0x556d57b30a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #23: + 0x150866 (0x556d57b43866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556d57b2c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #25: _PyFunction_Vectorcall + 0x6c (0x556d57b37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #26: PyObject_Call + 0xbc (0x556d57b43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556d57b2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #28: _PyFunction_Vectorcall + 0x6c (0x556d57b37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #30: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #32: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #34: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556d57b2ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #37: _PyObject_Call_Prepend + 0x69 (0x556d57b41c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #38: + 0x211239 (0x556d57c04239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #39: _PyObject_MakeTpCall + 0x26b (0x556d57b30a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556d57b2c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #41: _PyFunction_Vectorcall + 0x6c (0x556d57b37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556d57b27c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #43: _PyFunction_Vectorcall + 0x6c (0x556d57b37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #45: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #46: PyObject_Call + 0xbc (0x556d57b43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556d57b2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #48: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #49: PyObject_Call + 0xbc (0x556d57b43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556d57b2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #51: _PyFunction_Vectorcall + 0x6c (0x556d57b37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556d57b30007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #53: _PyObject_Call_Prepend + 0x69 (0x556d57b41c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #54: + 0x211239 (0x556d57c04239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #55: PyObject_Call + 0x207 (0x556d57b44067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556d57b2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #57: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556d57b288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #59: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #60: PyObject_Call + 0xbc (0x556d57b43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556d57b2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #62: + 0x150582 (0x556d57b43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #63: PyObject_Call + 0xbc (0x556d57b43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:[rank14]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank14]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fc81df897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank14]: frame #1: + 0x5b3a23e (0x7f8001cfc23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f8001cf6c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8001cf6f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8001cf7fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8001cac371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8001cac371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8001cac371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8001cac371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7fc94b9189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7fc94c0610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7fc94df978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #12: + 0x5adc309 (0x7f8001c9e309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #13: + 0x5ae6f10 (0x7f8001ca8f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #14: + 0x5ae6fa5 (0x7f8001ca8fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #15: + 0x5124446 (0x7f80012e6446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #16: + 0x1acf4b8 (0x7f7ffdc914b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #17: + 0x5aee004 (0x7f8001cb0004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #18: + 0x5af36b5 (0x7f8001cb56b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #19: + 0xd2631e (0x7f801489f31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #20: + 0x47def4 (0x7f8013ff6ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #21: + 0x1445a6 (0x5637a34ab5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5637a34a4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #23: + 0x150866 (0x5637a34b7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5637a34a0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5637a34aba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #26: PyObject_Call + 0xbc (0x5637a34b7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5637a349e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5637a34aba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #30: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #32: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #34: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5637a34a3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5637a34b5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #38: + 0x211239 (0x5637a3578239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5637a34a4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5637a34a03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5637a34aba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5637a349bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5637a34aba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #45: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #46: PyObject_Call + 0xbc (0x5637a34b7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5637a349e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #48: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #49: PyObject_Call + 0xbc (0x5637a34b7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5637a349e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5637a34aba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5637a34a4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5637a34b5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #54: + 0x211239 (0x5637a3578239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #55: PyObject_Call + 0x207 (0x5637a34b8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5637a349e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #57: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5637a349c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #59: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #60: PyObject_Call + 0xbc (0x5637a34b7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5637a349e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #62: + 0x150582 (0x5637a34b7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #63: PyObject_Call + 0xbc (0x5637a34b7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default5]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default5]:07/02/2024 19:52:56 [WARNING|DP=0|PP=3|TP=1|ip-26-0-172-73]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default5]:07/02/2024 19:52:56 [WARNING|DP=0|PP=3|TP=1|ip-26-0-172-73]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). W0702 19:52:56.430000 140024377644864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2561480 closing signal SIGTERM W0702 19:52:56.430000 140024377644864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2561481 closing signal SIGTERM W0702 19:52:56.431000 140024377644864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2561482 closing signal SIGTERM W0702 19:52:56.431000 140024377644864 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2561483 closing signal SIGTERM [default7]:07/02/2024 19:52:56 [WARNING|DP=0|PP=3|TP=3|ip-26-0-172-73]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default7]:07/02/2024 19:52:56 [WARNING|DP=0|PP=3|TP=3|ip-26-0-172-73]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default7]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default7]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). E0702 19:52:57.254000 140024377644864 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2561476) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_19:52:56 host : ip-26-0-169-139.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 2561477) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-02_19:52:56 host : ip-26-0-169-139.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 2561478) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-02_19:52:56 host : ip-26-0-169-139.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 2561479) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_19:52:56 host : ip-26-0-169-139.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 2561476) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-169-139: task 0: Exited with exit code 1 [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default5]:[rank13]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank13]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80476e6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank13]: frame #1: + 0x5b3a23e (0x7f808120323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f80811fdc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f80811fdf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f80811fefd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80811b3371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80811b3371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80811b3371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f80811b3371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f80489c0189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f80489c7610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f80489e6978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #12: + 0x5adc309 (0x7f80811a5309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #13: + 0x5ae6f10 (0x7f80811aff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #14: + 0x5ae6fa5 (0x7f80811affa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #15: + 0x5124446 (0x7f80807ed446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #16: + 0x1acf4b8 (0x7f807d1984b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #17: + 0x5aee004 (0x7f80811b7004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #18: + 0x5af36b5 (0x7f80811bc6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #19: + 0xd2631e (0x7f8093da631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #20: + 0x47def4 (0x7f80934fdef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #21: + 0x1445a6 (0x56040c7075a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56040c700a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #23: + 0x150866 (0x56040c713866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56040c6fc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56040c707a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #26: PyObject_Call + 0xbc (0x56040c713f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56040c6fa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56040c707a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #30: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #32: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #34: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56040c6fff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56040c711c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #38: + 0x211239 (0x56040c7d4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56040c700a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56040c6fc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56040c707a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56040c6f7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56040c707a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #45: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #46: PyObject_Call + 0xbc (0x56040c713f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56040c6fa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #48: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #49: PyObject_Call + 0xbc (0x56040c713f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56040c6fa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56040c707a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56040c700007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56040c711c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #54: + 0x211239 (0x56040c7d4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #55: PyObject_Call + 0x207 (0x56040c714067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56040c6fa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #57: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56040c6f88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #59: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #60: PyObject_Call + 0xbc (0x56040c713f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56040c6fa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #62: + 0x150582 (0x56040c713582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #63: PyObject_Call + 0xbc (0x56040c713f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:Exception in thread Thread-2 (_pin_memory_loop): [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [default5]: self.run() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 953, in run [default5]: self._target(*self._args, **self._kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop [default5]: do_one_step() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step [default5]: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/queues.py", line 122, in get [default5]: return _ForkingPickler.loads(res) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default7]:[rank15]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank15]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bad67a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank15]: frame #1: + 0x5b3a23e (0x7f8be719723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f8be7191c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8be7191f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8be7192fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8be7147371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8be7147371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8be7147371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8be7147371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8bae954189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8bae95b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8bae97a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #12: + 0x5adc309 (0x7f8be7139309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #13: + 0x5ae6f10 (0x7f8be7143f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #14: + 0x5ae6fa5 (0x7f8be7143fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #15: + 0x5124446 (0x7f8be6781446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #16: + 0x1acf4b8 (0x7f8be312c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #17: + 0x5aee004 (0x7f8be714b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #18: + 0x5af36b5 (0x7f8be71506b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #19: + 0xd2631e (0x7f8bf9d3a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #20: + 0x47def4 (0x7f8bf9491ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #21: + 0x1445a6 (0x56414c4665a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56414c45fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #23: + 0x150866 (0x56414c472866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56414c45b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56414c466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #26: PyObject_Call + 0xbc (0x56414c472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56414c4592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56414c466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #30: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #32: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #34: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56414c45ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56414c470c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #38: + 0x211239 (0x56414c533239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56414c45fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56414c45b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56414c466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56414c456c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56414c466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #45: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #46: PyObject_Call + 0xbc (0x56414c472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56414c4592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #48: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #49: PyObject_Call + 0xbc (0x56414c472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56414c4592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56414c466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56414c45f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56414c470c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #54: + 0x211239 (0x56414c533239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #55: PyObject_Call + 0x207 (0x56414c473067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56414c4592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #57: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56414c4578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #59: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #60: PyObject_Call + 0xbc (0x56414c472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56414c4592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #62: + 0x150582 (0x56414c472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #63: PyObject_Call + 0xbc (0x56414c472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:Exception in thread Thread-2 (_pin_memory_loop): [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [default7]: self.run() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 953, in run [default7]: self._target(*self._args, **self._kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop [default7]: do_one_step() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step [default7]: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/queues.py", line 122, in get [default7]: return _ForkingPickler.loads(res) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd [default7]: fd = df.detach() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach [default7]: with _resource_sharer.get_connection(self._id) as conn: [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection [default7]: c = Client(address, authkey=process.current_process().authkey) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/connection.py", line 508, in Client W0702 19:53:00.485000 139923676657408 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-172-73.ec2.internal_773582_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 19:53:01.442000 139929343477568 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 773655 closing signal SIGTERM W0702 19:53:01.442000 139929343477568 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 773657 closing signal SIGTERM E0702 19:53:02.059000 139929343477568 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 773650) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 19:53:02.067000 139929343477568 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-172-73.ec2.internal_773582_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 19:53:02.093000 139929343477568 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-172-73.ec2.internal_773582_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 19:53:02.115000 139929343477568 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-172-73.ec2.internal_773582_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 773651) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 10 (local_rank: 2) exitcode : 1 (pid: 773652) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 773653) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 773654) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 773656) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_19:53:01 host : ip-26-0-172-73.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 773650) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-172-73: task 1: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.