======================== START TIME: Tue Jul 2 19:58:41 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0702 19:58:45.131000 140351111956288 torch/distributed/run.py:757] W0702 19:58:45.131000 140351111956288 torch/distributed/run.py:757] ***************************************** W0702 19:58:45.131000 140351111956288 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 19:58:45.131000 140351111956288 torch/distributed/run.py:757] ***************************************** W0702 19:58:45.764000 140709075269440 torch/distributed/run.py:757] W0702 19:58:45.764000 140709075269440 torch/distributed/run.py:757] ***************************************** W0702 19:58:45.764000 140709075269440 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0702 19:58:45.764000 140709075269440 torch/distributed/run.py:757] ***************************************** [default0]:07/02/2024 19:59:04 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Config: [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: run='%date_%jobid', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: seed=42, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: step=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: consumed_train_samples=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: benchmark_csv_path=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: ignore_sanity_checks=True), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: parallelism=ParallelismArgs(dp=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pp=4, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tp=4, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pp_engine=, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tp_mode=, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tp_linear_async_communication=False, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: expert_parallel_size=1), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: eos_token_id=2, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hidden_act='silu', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hidden_size=2048, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: initializer_range=0.02, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: intermediate_size=4096, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: is_llama_config=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: max_position_embeddings=4096, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_attention_heads=32, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_hidden_layers=24, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_key_value_heads=32, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pad_token_id=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pretraining_tp=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rms_norm_eps=1e-05, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rope_scaling=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rope_theta=10000.0, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tie_word_embeddings=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: use_cache=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: vocab_size=50260), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: init_method=RandomInit(std=0.025), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: dtype=torch.bfloat16, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: make_vocab_size_divisible_by=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: ddp_bucket_cap_mb=25), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tokenizer_revision=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tokenizer_max_length=None), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: checkpoint_interval=100000, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: save_initial_state=False, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: resume_checkpoint_path=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: checkpoints_path_is_shared_file_system=False), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: logging=LoggingArgs(log_level='info', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: log_level_replica='info', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: iteration_step_info_interval=1), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tokens=TokensArgs(sequence_length=4096, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: train_steps=20, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: micro_batch_size=256, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: batch_accumulation_per_replica=4, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: val_check_interval=-1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: limit_val_batches=0, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: limit_test_batches=0), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: adam_beta1=0.9, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: adam_beta2=0.95, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: torch_adam_is_fused=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: name='adamW'), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: zero_stage=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: weight_decay=0.01, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: clip_grad=1.0, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: accumulate_grad_in_fp32=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lr_warmup_steps=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lr_warmup_style='linear', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lr_decay_style='linear', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lr_decay_steps=19, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lr_decay_starting_step=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: min_decay_lr=1e-05)), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: start_training_step=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hf_dataset_splits='train', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hf_dataset_config_name=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: dataset_processing_num_proc_per_process=64, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: dataset_overwrite_cache=False, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: text_column_name='text'), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: seed=42, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_loading_workers=32))], [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/16_GPUS/dp-1_tp-4_pp-4_mbz-256')), [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: lighteval=None) [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Model Config: [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: LlamaConfig(bos_token_id=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: eos_token_id=2, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hidden_act='silu', [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: hidden_size=2048, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: initializer_range=0.02, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: intermediate_size=4096, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: is_llama_config=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: max_position_embeddings=4096, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_attention_heads=32, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_hidden_layers=24, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: num_key_value_heads=32, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pad_token_id=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: pretraining_tp=1, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rms_norm_eps=1e-05, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rope_scaling=None, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: rope_theta=10000.0, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: tie_word_embeddings=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: use_cache=True, [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: vocab_size=50260) [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Building model.. [default0]:07/02/2024 19:59:04 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Setting PP block ranks... [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-226]: Local number of parameters: 62.9M (120.05MiB) [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-226]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-226]: Local number of parameters: 67.7M (129.12MiB) [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-226]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-226]: No checkpoint path provided. [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-226]: No checkpoint path provided. [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-147]: Local number of parameters: 73.4M (140.05MiB) [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Total number of parameters: 1.21G (2313.42MiB) [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Local number of parameters: 99.2M (189.14MiB) [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Parametrizing model parameters using StandardParametrizator [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-226]: Local number of parameters: 67.7M (129.12MiB) [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-226]: Local number of parameters: 67.7M (129.12MiB) [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-226]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-226]: Local number of parameters: 62.9M (120.05MiB) [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-226]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default4]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-226]: No checkpoint path provided. [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-226]: No checkpoint path provided. [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-226]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-226]: No checkpoint path provided. [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-147]: Local number of parameters: 99.2M (189.14MiB) [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default1]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-147]: Local number of parameters: 99.2M (189.14MiB) [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-226]: Local number of parameters: 62.9M (120.05MiB) [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-226]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-226]: No checkpoint path provided. [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-147]: Local number of parameters: 99.2M (189.14MiB) [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 197.07MiB. Peak allocated: 199.10MiB Peak reserved: 200.00MiB [default2]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/02/2024 19:59:18 [INFO|DP=0|PP=0|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-147]: Local number of parameters: 73.4M (140.05MiB) [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-226]: Local number of parameters: 62.9M (120.05MiB) [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default5]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=3|ip-26-0-163-147]: Local number of parameters: 73.4M (140.05MiB) [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-226]: [After model building] Memory usage: 126.06MiB. Peak allocated: 128.09MiB Peak reserved: 130.00MiB [default0]:07/02/2024 19:59:18 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-226]: No checkpoint path provided. [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-226]: Local number of parameters: 67.7M (129.12MiB) [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-226]: [After model building] Memory usage: 134.05MiB. Peak allocated: 136.08MiB Peak reserved: 138.00MiB [default7]:07/02/2024 19:59:18 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-226]: No checkpoint path provided. [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=2|ip-26-0-163-147]: Local number of parameters: 73.4M (140.05MiB) [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 147.07MiB. Peak allocated: 149.10MiB Peak reserved: 150.00MiB [default6]:07/02/2024 19:59:18 [INFO|DP=0|PP=1|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/02/2024 19:59:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/02/2024 19:59:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/02/2024 19:59:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [ZeRO sharding] DP Rank 0 has 99.2M out of 99.2M (100.00%) params' optimizer states [default0]:07/02/2024 19:59:21 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/02/2024 19:59:21 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Using `datasets` library [default0]:07/02/2024 19:59:21 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/02/2024 19:59:21 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Training Plan] There are 1 training stages [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Stage Training Stage] start from step 1 [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: [Start training] datetime: 2024-07-02 19:59:22.362959 | mbs: 256 | grad_accum: 4 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/02/2024 19:59:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-147]: Memory usage: 953.61MiB. Peak allocated 953.61MiB. Peak reserved: 960.00MiB [default4]:07/02/2024 19:59:22 [WARNING|DP=0|PP=1|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 19:59:22 [WARNING|DP=0|PP=3|TP=2|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 19:59:22 [WARNING|DP=0|PP=2|TP=2|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/02/2024 19:59:22 [WARNING|DP=0|PP=0|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/02/2024 19:59:22 [WARNING|DP=0|PP=2|TP=0|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 19:59:22 [WARNING|DP=0|PP=1|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 19:59:22 [WARNING|DP=0|PP=1|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/02/2024 19:59:22 [WARNING|DP=0|PP=3|TP=1|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 19:59:22 [WARNING|DP=0|PP=2|TP=1|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/02/2024 19:59:22 [WARNING|DP=0|PP=3|TP=0|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 19:59:22 [WARNING|DP=0|PP=2|TP=3|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/02/2024 19:59:22 [WARNING|DP=0|PP=0|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/02/2024 19:59:22 [WARNING|DP=0|PP=1|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/02/2024 19:59:22 [WARNING|DP=0|PP=3|TP=3|ip-26-0-163-226]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/02/2024 19:59:23 [WARNING|DP=0|PP=0|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank2]: output = self.o_proj(attention_output) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank2]: return row_linear( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank2]: out = F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.96 GiB is free. Including non-PyTorch memory, this process has 77.35 GiB memory in use. Of the allocated memory 67.08 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank1]: output = self.o_proj(attention_output) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank1]: return row_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank1]: out = F.linear(input, weight, bias) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 2.04 GiB is free. Including non-PyTorch memory, this process has 77.27 GiB memory in use. Of the allocated memory 67.08 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default0]:[rank0]: output = self.o_proj(attention_output) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default0]:[rank0]: return row_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default0]:[rank0]: out = F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank3]: output = self.o_proj(attention_output) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: return row_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank3]: out = F.linear(input, weight, bias) [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 2.34 GiB is free. Including non-PyTorch memory, this process has 76.98 GiB memory in use. Of the allocated memory 67.08 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: F[default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default3]:[rank11]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank9]: dist.recv( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default1]:[rank9]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank9]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f16e83c5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank9]: frame #1: + 0x5b3a23e (0x7f1721ee223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1721edcc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1721edcf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1721eddfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1721e92371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1721e92371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1721e92371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1721e92371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f16e969f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: return forward_call(*args, **kwargs) [default1]:[rank9]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f16e96a6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f16e96c5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank9]: frame #12: + 0x5adc309 (0x7f1721e84309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #13: + 0x5ae6f10 (0x7f1721e8ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: frame #14: + 0x5ae6fa5 (0x7f1721e8efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: frame #15: + 0x5124446 (0x7f17214cc446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #16: + 0x1acf4b8 (0x7f171de774b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #17: + 0x5aee004 (0x7f1721e96004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default1]:[rank9]: frame #18: + 0x5af36b5 (0x7f1721e9b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default1]:[rank9]: frame #19: + 0xd2631e (0x7f1734a8531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank9]: frame #20: + 0x47def4 (0x7f17341dcef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #21: + 0x1445a6 (0x5615227375a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank11]: dist.recv( [default1]:[rank9]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561522730a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: frame #23: + 0x150866 (0x561522743866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56152272c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561522737a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #26: PyObject_Call + 0xbc (0x561522743f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default3]:[rank11]: return func(*args, **kwargs) [default1]:[rank9]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56152272a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561522737a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #30: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: pipeline_state.run_communication() [default3]:[rank11]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank11]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f3e445897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: frame #1: + 0x5b3a23e (0x7f3f77f6223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: recv_activation_tensor = recv_activation() [default1]:[rank9]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: frame #32: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f3f77f5cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3f77f5cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3f77f5dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: output = model(**micro_batch) [default1]:[rank9]: frame #34: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3f77f12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56152272ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3f77f12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3f77f12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3f77f12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561522741c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: frame #38: + 0x211239 (0x561522804239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561522730a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3f3f71f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f3f3f726610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f3f3f745978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: frame #12: + 0x5adc309 (0x7f3f77f04309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56152272c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561522737a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default2]:[rank10]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa77179d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank10]: frame #1: + 0x5b3a23e (0x7fa7ab2ba23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #13: + 0x5ae6f10 (0x7f3f77f0ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa7ab2b4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa7ab2b4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561522727c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561522737a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa7ab2b5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7ab26a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #14: + 0x5ae6fa5 (0x7f3f77f0efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #15: + 0x5124446 (0x7f3f7754c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7ab26a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: frame #45: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7ab26a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #46: PyObject_Call + 0xbc (0x561522743f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa7ab26a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56152272a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #16: + 0x1acf4b8 (0x7f3f73ef74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #17: + 0x5aee004 (0x7f3f77f16004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa772a77189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa772a7e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa772a9d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank11]: frame #18: + 0x5af36b5 (0x7f3f77f1b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #48: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #12: + 0x5adc309 (0x7fa7ab25c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank9]: frame #49: PyObject_Call + 0xbc (0x561522743f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: frame #13: + 0x5ae6f10 (0x7fa7ab266f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ile "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-pack[default3]:[rank11]: frame #19: + 0xd2631e (0x7f3f8ab0531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #20: + 0x47def4 (0x7f3f8a25cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank11]: frame #21: + 0x1445a6 (0x55a48f2225a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: frame #14: + 0x5ae6fa5 (0x7fa7ab266fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_t[default3]:[rank11]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a48f21ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank6]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank8]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5f0839897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [defaul[default1]:[rank9]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56152272a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561522737a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #23: + 0x150866 (0x55a48f22e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) t6]:[rank6]: frame #1: + 0x5b3a23e (0x7fa62a35623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa62a350c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa62a350f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa62a351fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa62a306371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.1[default2]:[rank10]: frame #15: + 0x5124446 (0x7fa7aa8a4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #16: + 0x1acf4b8 (0x7fa7a724f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 0/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa62a306371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa62a306371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa62a306371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa5f1b13189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c1[default3]:[rank11]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a48f217142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a48f222a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 0d::OpType, int, bool) + 0xc50 (0x7fa5f1b1a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa5f1b39978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank6]: frame #12: + 0x5adc309 (0x7fa62a2f8309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #13: + 0x5ae6f10 (0x7fa62a302f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #17: + 0x5aee004 (0x7fa7ab26e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #14: + 0x5ae6fa5 (0x7fa62a302fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #15: + 0x5124446 (0x7fa629940446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #16: + 0x1acf4b8 (0x7fa6262eb4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #17: + 0x5aee004 (0x7fa62a30a004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #18: + 0x5af36b5 (0x7fa62a30f6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561522730007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #19: + 0xd2631e (0x7fa63cef931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #20: + 0x47def4 (0x7fa63c650ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #21: + 0x1445a6 (0x564b840fb5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561522741c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x564b840f4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #23: + 0x150866 (0x564b84107866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564b840f0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x564b840fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: recv_activation_tensor = recv_activation() [default3]:[rank11]: frame #26: PyObject_Call + 0xbc (0x55a48f22ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x564b84107f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x564b840ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x564b840fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #30: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #32: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/b[default3]:[rank11]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a48f2152b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) in/python3.10) [default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #34: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564b840f3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x564b84105c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #38: + 0x211239 (0x564b841c8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x564b840f4a6b in /fsx/ferdinandmom/miniforge3[default3]:[rank11]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a48f222a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x564b840f03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x564b840fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: frame #54: + 0x211239 (0x561522804239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564b840ebc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x564b840fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #45: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x564b84107f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #18: + 0x5af36b5 (0x7fa7ab2736b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x564b840ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #48: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x564b84107f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: frame #30: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #55: PyObject_Call + 0x207 (0x561522744067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x564b840ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x564b840fba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564b840f4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x564b84105c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: frame #54: + 0x211239 (0x564b841c8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x564b84108067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x564b840ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56152272a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #57: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x564b840ec8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #59: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: frame #19: + 0xd2631e (0x7fa7bde5d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #20: + 0x47def4 (0x7fa7bd5b4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x564b84107f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x564b840ee2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #62: + 0x150582 (0x564b84107582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #21: + 0x1445a6 (0x561d4d0205a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x564b84107f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank11]: frame #32: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: frame #57: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #34: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank[default3]:[rank11]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a48f21af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a48f22cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank11]: frame #38: + 0x211239 (0x55a48f2ef239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_t[default0]:[rank8]: dist.recv( ensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5615227288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #59: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank4]: dist.recv( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: return func(*args, **kwargs) [default4]:[rank4]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7e42795897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank10]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561d4d019a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a48f21ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #1: + 0x5b3a23e (0x7f7e7c2b223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a48f2173e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7e7c2acc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7e7c2acf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a48f222a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7e7c2adfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #23: + 0x150866 (0x561d4d02c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7e7c262371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7e7c262371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7e7c262371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7e7c262371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank11]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a48f212c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7e43a6f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7e43a76610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7e43a95978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: frame #12: + 0x5adc309 (0x7f7e7c254309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #13: + 0x5ae6f10 (0x7f7e7c25ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank4]: frame #14: + 0x5ae6fa5 (0x7f7e7c25efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #15: + 0x5124446 (0x7f7e7b89c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #16: + 0x1acf4b8 (0x7f7e782474b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default4]:[rank4]: frame #17: + 0x5aee004 (0x7f7e7c266004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #18: + 0x5af36b5 (0x7f7e7c26b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #19: + 0xd2631e (0x7f7e8ee5531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank4]: frame #20: + 0x47def4 (0x7f7e8e5acef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank4]: frame #21: + 0x1445a6 (0x5567534b95a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5567534b2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6f4ef1d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank8]: frame #1: + 0x5b3a23e (0x7f6f88a3a23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #23: + 0x150866 (0x5567534c5866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5567534ae142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5567534b9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x5567534c5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6f88a34c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5567534ac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5567534b9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #30: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6f88a34f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6f88a35fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #32: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #34: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f889ea371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f889ea371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f889ea371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5567534b1f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5567534c3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #38: + 0x211239 (0x556753586239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f889ea371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6f501f7189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5567534b2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5567534ae3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5567534b9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5567534a9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6f501fe610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6f5021d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank8]: frame #12: + 0x5adc309 (0x7f6f889dc309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5567534b9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #45: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #13: + 0x5ae6f10 (0x7f6f889e6f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x5567534c5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5567534ac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #48: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x5567534c5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561d4d015142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5567534ac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5567534b9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5567534b2007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5567534c3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561d4d020a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #54: + 0x211239 (0x556753586239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x5567534c6067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5567534ac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #57: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #26: PyObject_Call + 0xbc (0x561d4d02cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5567534aa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #59: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x5567534c5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561d4d0132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5567534ac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #62: + 0x150582 (0x5567534c5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x5567534c5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561d4d020a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank10]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default2]:[rank10]: frame #30: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/[default2]:[rank10]: frame #32: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipelin[default2]:[rank10]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) e_state_buffer [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: recv_activation_tensor = recv_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: frame #34: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages[default2]:[rank10]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561d4d018f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc7eae67897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank5]: frame #1: + 0x5b3a23e (0x7fc82498423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc82497ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [defa[default2]:[rank10]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561d4d02ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ult5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc82497ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc82497ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #38: + 0x211239 (0x561d4d0ed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc824934371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc824934371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc824934371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc824934371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc7ec141189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-clus[default2]:[rank10]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561d4d019a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ter/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc7ec148610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc7ec167978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank10]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561d4d0153e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #12: + 0x5adc309 (0x7fc824926309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #13: + 0x5ae6f10 (0x7fc824930f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561d4d020a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #14: + 0x5ae6fa5 (0x7fc824930fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #15: + 0x5124446 (0x7fc823f6e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #16: + 0x1acf4b8 (0x7fc8209194b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank10]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561d4d010c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #17: + 0x5aee004 (0x7fc824938004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #18: + 0x5af36b5 (0x7fc82493d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #19: + 0xd2631e (0x7fc83752731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank10]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561d4d020a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #20: + 0x47def4 (0x7fc836c7eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #21: + 0x1445a6 (0x555900e1c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555900e15a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #23: + 0x150866 (0x555900e28866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555900e11142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555900e1ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #45: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x555900e28f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555900e0f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555900e1ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #46: PyObject_Call + 0xbc (0x561d4d02cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #30: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #32: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561d4d0132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #14: + 0x5ae6fa5 (0x7f6f889e6fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #15: + 0x5124446 (0x7f6f88024446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #16: + 0x1acf4b8 (0x7f6f849cf4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #34: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555900e14f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #17: + 0x5aee004 (0x7f6f889ee004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555900e26c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #38: + 0x211239 (0x555900ee9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555900e15a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555900e113e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #18: + 0x5af36b5 (0x7f6f889f36b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555900e1ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555900e0cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555900e1ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #48: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #45: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x555900e28f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555900e0f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #49: PyObject_Call + 0xbc (0x561d4d02cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #19: + 0xd2631e (0x7f6f9b5dd31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #48: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x555900e28f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555900e0f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #20: + 0x47def4 (0x7f6f9ad34ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555900e1ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555900e15007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555900e26c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #54: + 0x211239 (0x555900ee9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561d4d0132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561d4d020a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x555900e29067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555900e0f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #57: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555900e0d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #21: + 0x1445a6 (0x5648b9f655a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #59: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x555900e28f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555900e0f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561d4d019007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561d4d02ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #62: + 0x150582 (0x555900e28582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x555900e28f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank10]: frame #54: + 0x211239 (0x561d4d0ed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default2]:[rank10]: frame #55: PyObject_Call + 0x207 (0x561d4d02d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5648b9f5ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561d4d0132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #57: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561d4d0118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7[default2]:[rank10]: frame #59: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #60: PyObject_Call + 0xbc (0x561d4d02cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_p[default2]:[rank10]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561d4d0132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank10]: frame #62: + 0x150582 (0x561d4d02c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) arallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: [default2]:[rank10]: frame #63: PyObject_Call + 0xbc (0x561d4d02cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: frame #23: + 0x150866 (0x5648b9f71866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1485327897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank7]: frame #1: + 0x5b3a23e (0x7f14bee4423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f14bee3ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench[default2]:[rank10]: . This may indicate a possible application crash on rank 0 or a network set up issue. -cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f14bee3ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f14bee3ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f14bedf4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f14bedf4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5648b9f5a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5648b9f65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f14bedf4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f14bedf4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1486601189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1486608610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1486627978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank7]: frame #12: + 0x5adc309 (0x7f14bede6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5648b9f582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5648b9f65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #13: + 0x5ae6f10 (0x7f14bedf0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #30: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #32: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #14: + 0x5ae6fa5 (0x7f14bedf0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank8]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #34: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #15: + 0x5124446 (0x7f14be42e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #16: + 0x1acf4b8 (0x7f14badd94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #17: + 0x5aee004 (0x7f14bedf8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #18: + 0x5af36b5 (0x7f14bedfd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank7]: frame #19: + 0xd2631e (0x7f14d19e731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank8]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5648b9f5df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #20: + 0x47def4 (0x7f14d113eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank7]: frame #21: + 0x1445a6 (0x56047d8d35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56047d8cca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5648b9f6fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #38: + 0x211239 (0x5648ba032239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5648b9f5ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #23: + 0x150866 (0x56047d8df866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56047d8c8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56047d8d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x56047d8dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56047d8c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5648b9f5a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5648b9f65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5648b9f55c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56047d8d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #30: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5648b9f65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #32: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #34: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #45: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #46: PyObject_Call + 0xbc (0x5648b9f71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5648b9f582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56047d8cbf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56047d8ddc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #38: + 0x211239 (0x56047d9a0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56047d8cca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56047d8c83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #48: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #49: PyObject_Call + 0xbc (0x5648b9f71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56047d8d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56047d8c3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56047d8d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5648b9f582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5648b9f65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5648b9f5e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #45: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x56047d8dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56047d8c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #48: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5648b9f6fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #54: + 0x211239 (0x5648ba032239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x56047d8dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56047d8c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56047d8d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56047d8cc007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #55: PyObject_Call + 0x207 (0x5648b9f72067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5648b9f582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #57: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56047d8ddc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #54: + 0x211239 (0x56047d9a0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x56047d8e0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56047d8c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #57: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5648b9f568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #59: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #60: PyObject_Call + 0xbc (0x5648b9f71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56047d8c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #59: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x56047d8dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56047d8c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #62: + 0x150582 (0x56047d8df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5648b9f582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: frame #62: + 0x150582 (0x5648b9f71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x56047d8dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank8]: frame #63: PyObject_Call + 0xbc (0x5648b9f71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank8]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:[rank14]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank14]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c9e082897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank14]: frame #1: + 0x5b3a23e (0x7f9cd7b9f23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f9cd7b99c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9cd7b99f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9cd7b9afd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9cd7b4f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9cd7b4f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9cd7b4f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9cd7b4f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9c9f35c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f9c9f363610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f9c9f382978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank14]: frame #12: + 0x5adc309 (0x7f9cd7b41309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #13: + 0x5ae6f10 (0x7f9cd7b4bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #14: + 0x5ae6fa5 (0x7f9cd7b4bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #15: + 0x5124446 (0x7f9cd7189446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #16: + 0x1acf4b8 (0x7f9cd3b344b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #17: + 0x5aee004 (0x7f9cd7b53004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #18: + 0x5af36b5 (0x7f9cd7b586b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank14]: frame #19: + 0xd2631e (0x7f9cea74231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #20: + 0x47def4 (0x7f9ce9e99ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank14]: frame #21: + 0x1445a6 (0x55a1aae7a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a1aae73a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #23: + 0x150866 (0x55a1aae86866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a1aae6f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a1aae7aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #26: PyObject_Call + 0xbc (0x55a1aae86f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1aae6d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a1aae7aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #30: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #32: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #34: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a1aae72f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a1aae84c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #38: + 0x211239 (0x55a1aaf47239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a1aae73a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a1aae6f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a1aae7aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a1aae6ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a1aae7aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #45: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #46: PyObject_Call + 0xbc (0x55a1aae86f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1aae6d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #48: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #49: PyObject_Call + 0xbc (0x55a1aae86f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1aae6d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a1aae7aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a1aae73007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a1aae84c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #54: + 0x211239 (0x55a1aaf47239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #55: PyObject_Call + 0x207 (0x55a1aae87067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1aae6d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #57: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a1aae6b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #59: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #60: PyObject_Call + 0xbc (0x55a1aae86f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1aae6d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #62: + 0x150582 (0x55a1aae86582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: frame #63: PyObject_Call + 0xbc (0x55a1aae86f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank14]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank9]: frame #60: PyObject_Call + 0xbc (0x561522743f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56152272a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #62: + 0x150582 (0x561522743582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: frame #63: PyObject_Call + 0xbc (0x561522743f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank9]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank11]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a48f222a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #45: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #46: PyObject_Call + 0xbc (0x55a48f22ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a48f2152b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #48: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #49: PyObject_Call + 0xbc (0x55a48f22ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a48f2152b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a48f222a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a48f21b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a48f22cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #54: + 0x211239 (0x55a48f2ef239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #55: PyObject_Call + 0x207 (0x55a48f22f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a48f2152b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #57: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a48f2138fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #59: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #60: PyObject_Call + 0xbc (0x55a48f22ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a48f2152b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #62: + 0x150582 (0x55a48f22e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: frame #63: PyObject_Call + 0xbc (0x55a48f22ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank11]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default5]:[rank13]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): [default5]:[rank13]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb5c1b07897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank13]: frame #1: + 0x5b3a30c (0x7fb5fb62430c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb5fb61ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb5fb61ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb5fb61ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5fb5d4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5fb5d4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5fb5d4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb5fb5d4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb5c2de1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb5c2de8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb5c2e07978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank13]: frame #12: + 0x5adc309 (0x7fb5fb5c6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #13: + 0x5ae6f10 (0x7fb5fb5d0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #14: + 0x5ae6fa5 (0x7fb5fb5d0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #15: + 0x5124446 (0x7fb5fac0e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #16: + 0x1acf4b8 (0x7fb5f75b94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #17: + 0x5aee004 (0x7fb5fb5d8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #18: + 0x5af36b5 (0x7fb5fb5dd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank13]: frame #19: + 0xd2631e (0x7fb60e1c731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #20: + 0x47def4 (0x7fb60d91eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank13]: frame #21: + 0x1445a6 (0x5624d3e725a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5624d3e6ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #23: + 0x150866 (0x5624d3e7e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5624d3e67142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5624d3e72a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #26: PyObject_Call + 0xbc (0x5624d3e7ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5624d3e652b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5624d3e72a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #30: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #32: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #34: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5624d3e6af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5624d3e7cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #38: + 0x211239 (0x5624d3f3f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5624d3e6ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5624d3e673e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5624d3e72a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5624d3e62c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5624d3e72a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #45: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #46: PyObject_Call + 0xbc (0x5624d3e7ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5624d3e652b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #48: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #49: PyObject_Call + 0xbc (0x5624d3e7ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5624d3e652b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5624d3e72a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5624d3e6b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5624d3e7cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #54: + 0x211239 (0x5624d3f3f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #55: PyObject_Call + 0x207 (0x5624d3e7f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5624d3e652b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #57: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5624d3e638fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #59: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #60: PyObject_Call + 0xbc (0x5624d3e7ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5624d3e652b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #62: + 0x150582 (0x5624d3e7e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: frame #63: PyObject_Call + 0xbc (0x5624d3e7ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank13]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default7]:[rank15]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank15]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff33a1f5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank15]: frame #1: + 0x5b3a23e (0x7ff373d1223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7ff373d0cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff373d0cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff373d0dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff373cc2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff373cc2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff373cc2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff373cc2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff33b4cf189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7ff33b4d6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7ff33b4f5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank15]: frame #12: + 0x5adc309 (0x7ff373cb4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #13: + 0x5ae6f10 (0x7ff373cbef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #14: + 0x5ae6fa5 (0x7ff373cbefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #15: + 0x5124446 (0x7ff3732fc446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #16: + 0x1acf4b8 (0x7ff36fca74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #17: + 0x5aee004 (0x7ff373cc6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #18: + 0x5af36b5 (0x7ff373ccb6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank15]: frame #19: + 0xd2631e (0x7ff3868b531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #20: + 0x47def4 (0x7ff38600cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank15]: frame #21: + 0x1445a6 (0x559253d275a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559253d20a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #23: + 0x150866 (0x559253d33866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559253d1c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559253d27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #26: PyObject_Call + 0xbc (0x559253d33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559253d1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559253d27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #30: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #32: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #34: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559253d1ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559253d31c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #38: + 0x211239 (0x559253df4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559253d20a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559253d1c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559253d27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559253d17c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559253d27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #45: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #46: PyObject_Call + 0xbc (0x559253d33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559253d1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #48: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #49: PyObject_Call + 0xbc (0x559253d33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559253d1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559253d27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559253d20007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559253d31c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #54: + 0x211239 (0x559253df4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #55: PyObject_Call + 0x207 (0x559253d34067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559253d1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #57: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559253d188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #59: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #60: PyObject_Call + 0xbc (0x559253d33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559253d1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #62: + 0x150582 (0x559253d33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: frame #63: PyObject_Call + 0xbc (0x559253d33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank15]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:Exception in thread Thread-2 (_pin_memory_loop): [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [default7]: self.run() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 953, in run [default7]: self._target(*self._args, **self._kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop [default7]: do_one_step() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step [default7]: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/queues.py", line 122, in get [default5]:Exception in thread Thread-2 (_pin_memory_loop): [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [default5]: self.run() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 953, in run [default5]: self._target(*self._args, **self._kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop [default5]: do_one_step() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step [default5]: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/queues.py", line 122, in get [default7]: return _ForkingPickler.loads(res) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd [default7]: fd = df.detach() [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach [default7]: return reduction.recv_handle(conn) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle [default7]: return recvfds(s, 1)[0] [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/multiprocessing/reduction.py", line 157, in recvfds [default7]: msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_SPACE(bytes_size)) [default7]:ConnectionResetError: [Errno 104] Connection reset by peer [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank12]: recv_activation_tensor = recv_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank12]: dist.recv( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]:[rank12]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank12]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff569c3c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank12]: frame #1: + 0x5b3a23e (0x7ff5a375923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7ff5a3753c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff5a3753f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff5a3754fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff5a3709371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff5a3709371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff5a3709371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff5a3709371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff56af16189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7ff56af1d610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7ff56af3c978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank12]: frame #12: + 0x5adc309 (0x7ff5a36fb309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #13: + 0x5ae6f10 (0x7ff5a3705f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #14: + 0x5ae6fa5 (0x7ff5a3705fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #15: + 0x5124446 (0x7ff5a2d43446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #16: + 0x1acf4b8 (0x7ff59f6ee4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #17: + 0x5aee004 (0x7ff5a370d004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #18: + 0x5af36b5 (0x7ff5a37126b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank12]: frame #19: + 0xd2631e (0x7ff5b62fc31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank12]: frame #20: + 0x47def4 (0x7ff5b5a53ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank12]: frame #21: + 0x1445a6 (0x5614103635a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56141035ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #23: + 0x150866 (0x56141036f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561410358142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561410363a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #26: PyObject_Call + 0xbc (0x56141036ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5614103562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561410363a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #30: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #32: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #34: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56141035bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56141036dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #38: + 0x211239 (0x561410430239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56141035ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5614103583e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561410363a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561410353c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561410363a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #45: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #46: PyObject_Call + 0xbc (0x56141036ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5614103562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #48: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #49: PyObject_Call + 0xbc (0x56141036ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5614103562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561410363a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56141035c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56141036dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #54: + 0x211239 (0x561410430239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #55: PyObject_Call + 0x207 (0x561410370067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5614103562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #57: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5614103548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #59: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #60: PyObject_Call + 0xbc (0x56141036ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5614103562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #62: + 0x150582 (0x56141036f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: frame #63: PyObject_Call + 0xbc (0x56141036ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank12]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:Exception in thread Thread-2 (_pin_memory_loop): [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/threading.py", line 1016, in _bootstrap_inner W0702 19:59:31.430000 140351111956288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 694444 closing signal SIGTERM W0702 19:59:31.431000 140351111956288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 694445 closing signal SIGTERM W0702 19:59:31.431000 140351111956288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 694446 closing signal SIGTERM W0702 19:59:31.432000 140351111956288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 694447 closing signal SIGTERM E0702 19:59:32.654000 140351111956288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 694440) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_19:59:31 host : ip-26-0-163-147.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 694441) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-02_19:59:31 host : ip-26-0-163-147.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 694442) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-02_19:59:31 host : ip-26-0-163-147.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 694443) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_19:59:31 host : ip-26-0-163-147.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 694440) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-147: task 0: Exited with exit code 1 W0702 19:59:35.987000 140703408449280 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-226.ec2.internal_3100356_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. E0702 19:59:36.558000 140709075269440 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3100425) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0702 19:59:36.564000 140709075269440 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3100356_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 19:59:36.592000 140709075269440 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3100356_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0702 19:59:36.623000 140709075269440 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-226.ec2.internal_3100356_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 3100426) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 10 (local_rank: 2) exitcode : 1 (pid: 3100427) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 3100428) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 3100429) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 3100430) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 3100431) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 3100432) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-02_19:59:36 host : ip-26-0-163-226.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 3100425) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-226: task 1: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.