======================== START TIME: Wed Jul 3 03:17:36 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 03:17:39.739000 140703858952000 torch/distributed/run.py:757] W0703 03:17:39.739000 140703858952000 torch/distributed/run.py:757] ***************************************** W0703 03:17:39.739000 140703858952000 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:39.739000 140703858952000 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.851000 140665481819968 torch/distributed/run.py:757] W0703 03:17:42.851000 140665481819968 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.851000 140665481819968 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.851000 140665481819968 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.857000 140598708508480 torch/distributed/run.py:757] W0703 03:17:42.857000 140598708508480 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.857000 140598708508480 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.857000 140598708508480 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.865000 140672604792640 torch/distributed/run.py:757] W0703 03:17:42.865000 140672604792640 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.865000 140672604792640 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.865000 140672604792640 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.867000 139993479001920 torch/distributed/run.py:757] W0703 03:17:42.867000 139993479001920 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.867000 139993479001920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.867000 139993479001920 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.888000 139850509760320 torch/distributed/run.py:757] W0703 03:17:42.888000 139850509760320 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.888000 139850509760320 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.888000 139850509760320 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.981000 140661720811328 torch/distributed/run.py:757] W0703 03:17:42.981000 140661720811328 torch/distributed/run.py:757] ***************************************** W0703 03:17:42.981000 140661720811328 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:42.981000 140661720811328 torch/distributed/run.py:757] ***************************************** W0703 03:17:43.280000 139967333427008 torch/distributed/run.py:757] W0703 03:17:43.280000 139967333427008 torch/distributed/run.py:757] ***************************************** W0703 03:17:43.280000 139967333427008 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 03:17:43.280000 139967333427008 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 03:18:07 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Vocab Size Padding] Padded vocab (size: 50257) with 15 dummy tokens (new size: 50272) [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Config: [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: run='%date_%jobid', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: seed=42, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: step=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: consumed_train_samples=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: benchmark_csv_path=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: ignore_sanity_checks=True), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pp=4, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp=16, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pp_engine=, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp_mode=, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp_linear_async_communication=False, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: expert_parallel_size=1), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: eos_token_id=2, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_act='silu', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_size=2048, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: initializer_range=0.02, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: intermediate_size=4096, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: is_llama_config=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: max_position_embeddings=4096, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_attention_heads=32, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_hidden_layers=24, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_key_value_heads=32, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pad_token_id=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pretraining_tp=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rms_norm_eps=1e-05, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_scaling=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_theta=10000.0, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tie_word_embeddings=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: use_cache=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: vocab_size=50272), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dtype=torch.bfloat16, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer_revision=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer_max_length=None), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoint_interval=100000, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: save_initial_state=False, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: resume_checkpoint_path=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: log_level_replica='info', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: iteration_step_info_interval=1), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: train_steps=20, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: micro_batch_size=256, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: batch_accumulation_per_replica=4, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: val_check_interval=-1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: limit_val_batches=0, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: limit_test_batches=0), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: adam_beta1=0.9, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: adam_beta2=0.95, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: torch_adam_is_fused=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: name='adamW'), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: zero_stage=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: weight_decay=0.01, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: clip_grad=1.0, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_warmup_steps=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_warmup_style='linear', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_style='linear', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_steps=19, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_starting_step=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: min_decay_lr=1e-05)), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: start_training_step=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hf_dataset_splits='train', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hf_dataset_config_name=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dataset_overwrite_cache=False, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: text_column_name='text'), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: seed=42, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_loading_workers=0))], [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-16_pp-4_mbz-256')), [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lighteval=None) [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Model Config: [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: eos_token_id=2, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_act='silu', [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_size=2048, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: initializer_range=0.02, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: intermediate_size=4096, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: is_llama_config=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: max_position_embeddings=4096, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_attention_heads=32, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_hidden_layers=24, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_key_value_heads=32, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pad_token_id=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pretraining_tp=1, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rms_norm_eps=1e-05, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_scaling=None, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_theta=10000.0, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tie_word_embeddings=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: use_cache=True, [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: vocab_size=50272) [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Building model.. [default0]:07/03/2024 03:18:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Setting PP block ranks... [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=8|ip-26-0-161-78]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=15|ip-26-0-161-78]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=11|ip-26-0-161-78]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=9|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=8|ip-26-0-171-88]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=12|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=2|ip-26-0-171-62]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=6|ip-26-0-171-62]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=10|ip-26-0-161-78]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=9|ip-26-0-171-102]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=10|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=15|ip-26-0-171-102]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=10|ip-26-0-171-102]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=9|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=14|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=15|ip-26-0-171-88]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=8|ip-26-0-161-138]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=9|ip-26-0-161-138]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=11|ip-26-0-171-102]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=6|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=6|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=6|ip-26-0-166-125]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=0|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=14|ip-26-0-161-138]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=0|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=0|ip-26-0-166-125]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-103]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=14|ip-26-0-161-78]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=13|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=12|ip-26-0-161-138]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=12|ip-26-0-161-78]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=1|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=1|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=4|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=4|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=4|ip-26-0-166-125]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=11|ip-26-0-161-138]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=1|ip-26-0-166-125]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=13|ip-26-0-161-138]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=15|ip-26-0-161-138]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=7|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=7|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=7|ip-26-0-166-125]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=14|ip-26-0-171-102]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=5|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=13|ip-26-0-171-102]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=13|ip-26-0-171-88]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=5|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=5|ip-26-0-166-125]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=3|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=3|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=3|ip-26-0-166-125]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=8|ip-26-0-171-102]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=11|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-138]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-138]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=10|ip-26-0-161-138]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=5|ip-26-0-171-62]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=12|ip-26-0-171-102]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Total number of parameters: 1.21G (2315.81MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Parametrizing model parameters using StandardParametrizator [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=3|ip-26-0-171-62]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=5|ip-26-0-161-153]: No checkpoint path provided. [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-153]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-103]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=4|ip-26-0-171-62]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-103]: No checkpoint path provided. [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=4|ip-26-0-161-153]: No checkpoint path provided. [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-103]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=3|TP=7|ip-26-0-171-62]: No checkpoint path provided. [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=2|ip-26-0-166-125]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=2|ip-26-0-166-125]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/03/2024 03:18:24 [INFO|DP=0|PP=2|TP=2|ip-26-0-166-125]: No checkpoint path provided. [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default7]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=7|ip-26-0-161-153]: No checkpoint path provided. [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default3]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-153]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-103]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-103]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-103]: No checkpoint path provided. [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default6]:07/03/2024 03:18:24 [INFO|DP=0|PP=1|TP=6|ip-26-0-161-153]: No checkpoint path provided. [default0]:07/03/2024 03:18:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 03:18:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 03:18:26 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [ZeRO sharding] DP Rank 0 has 24.8M out of 24.8M (100.00%) params' optimizer states [default0]:07/03/2024 03:18:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 03:18:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Using `datasets` library [default0]:07/03/2024 03:18:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:28 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Training Plan] There are 1 training stages [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Start training] datetime: 2024-07-03 03:18:29.263250 | mbs: 256 | grad_accum: 4 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 03:18:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Memory usage: 244.38MiB. Peak allocated 244.38MiB. Peak reserved: 266.00MiB [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=11|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=12|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=15|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=10|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=10|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=14|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=9|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=15|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=12|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=4|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=13|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=5|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=3|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=10|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=5|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=4|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=3|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=7|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=6|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=6|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=8|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=15|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=6|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=15|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=9|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=8|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=14|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=0|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=7|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=1|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=12|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=1|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=13|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=12|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=11|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=2|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=3|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=7|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=2|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=5|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=2|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=3|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=9|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=11|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=13|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=14|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=14|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=4|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=7|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=9|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=6|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=8|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=5|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 03:18:29 [WARNING|DP=0|PP=1|TP=4|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=2|TP=10|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 03:18:29 [WARNING|DP=0|PP=0|TP=13|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 03:18:29 [WARNING|DP=0|PP=3|TP=2|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 03:18:30 [WARNING|DP=0|PP=2|TP=11|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:[rank0]: Traceback (most recent call last): [default7]:[rank7]: Traceback (most recent call last): [default4]:[rank12]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: sharded_logits = self.model( [default4]:[rank12]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default7]:[rank7]: return forward_call(*args, **kwargs) [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: return forward_call(*args, **kwargs) [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return forward_call(*args, **kwargs) [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank7]: output = self.pp_block(**new_kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank8]: output = self.pp_block(**new_kwargs) [default7]:[rank7]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: output = self.o_proj(attention_output) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank12]: output = self.pp_block(**new_kwargs) [default7]:[rank7]: return row_linear( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: out = F.linear(input, weight, bias) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.14 GiB is free. Including non-PyTorch memory, this process has 78.18 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank12]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: output = self.o_proj(attention_output) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: return row_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default0]:[rank0]: out = F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank12]: output = self.o_proj(attention_output) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: output = self.o_proj(attention_output) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank12]: return forward_call(*args, **kwargs) [default0]:[rank8]: return row_linear( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank12]: return row_linear( [default0]:[rank8]: out = F.linear(input, weight, bias) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank12]: out = F.linear(input, weight, bias) [default4]:[rank12]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 567.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank8]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank15]: output = self.pp_block(**new_kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank15]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank15]: output = self.o_proj(attention_output) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank15]: return row_linear( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default7]:[rank15]: out = F.linear(input, weight, bias) [default7]:[rank15]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 655.94 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/pytho[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl n3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [defaul[default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [defaultt3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) 6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank14]: output = self.pp_block(**new_kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/ forward [default6]:[rank14]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank3]: output = self.o_proj(attention_output) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: return row_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/f[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank14]: output = self.o_proj(attention_output) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) erdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank3]: out = F.linear(input, weight, bias) [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.14 GiB is free. Including non-PyTorch memory, this process has 78.18 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank14]: return row_linear( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank14]: out = F.linear(input, weight, bias) [default6]:[rank14]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 567.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank11]: output = self.pp_block(**new_kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank11]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank11]: output = self.o_proj(attention_output) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank11]: return row_linear( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank11]: out = F.linear(input, weight, bias) [default3]:[rank11]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 655.94 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank10]: output = self.pp_block(**new_kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank10]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank10]: output = self.o_proj(attention_output) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank10]: return row_linear( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank10]: out = F.linear(input, weight, bias) [default2]:[rank10]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 567.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default4]:[rank4]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank4]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank4]: output = self.o_proj(attention_output) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank2]: output = self.o_proj(attention_output) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank4]: return row_linear( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank4]: out = F.linear(input, weight, bias) [default2]:[rank2]: return row_linear( [default4]:[rank4]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.23 GiB is free. Including non-PyTorch memory, this process has 78.09 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank2]: out = F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.23 GiB is free. Including non-PyTorch memory, this process has 78.09 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank1]: output = self.o_proj(attention_output) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank1]: return row_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank1]: out = F.linear(input, weight, bias) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.14 GiB is free. Including non-PyTorch memory, this process has 78.18 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank9]: output = self.pp_block(**new_kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank9]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank9]: output = self.o_proj(attention_output) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank9]: return row_linear( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank9]: out = F.linear(input, weight, bias) [default1]:[rank9]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 655.94 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank13]: output = self.pp_block(**new_kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank13]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank13]: output = self.o_proj(attention_output) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank13]: return row_linear( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank13]: out = F.linear(input, weight, bias) [default5]:[rank13]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 655.94 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: sharded_logits = self.model( [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank6]: output = self.pp_block(**new_kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default6]:[rank6]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank6]: output = self.o_proj(attention_output) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: return row_linear( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank6]: out = F.linear(input, weight, bias) [default5]:[rank5]: output = self.pp_block(**new_kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank5]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.23 GiB is free. Including non-PyTorch memory, this process has 78.09 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank5]: output = self.o_proj(attention_output) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank5]: return row_linear( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank5]: out = F.linear(input, weight, bias) [default5]:[rank5]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.14 GiB is free. Including non-PyTorch memory, this process has 78.18 GiB memory in use. Of the allocated memory 68.59 GiB is allocated by PyTorch, and 457.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank25]: Traceback (most recent call last): [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank25]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank25]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank25]: output = model(**micro_batch) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: return forward_call(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: sharded_logits = self.model( [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: trainer.train(dataloader) [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank28]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank26]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank25]: return forward_call(*args, **kwargs) [default2]:[rank26]: output = model(**micro_batch) [default3]:[rank59]: Traceback (most recent call last): [default7]:[rank55]: Traceback (most recent call last): [default0]:[rank48]: Traceback (most recent call last): [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank59]: trainer.train(dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank59]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank59]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank59]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: output = model(**micro_batch) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank55]: trainer.train(dataloader) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank55]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank55]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank48]: trainer.train(dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: sharded_logits = self.model( [default3]:[rank27]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank48]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank60]: Traceback (most recent call last): [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: trainer.train(dataloader) [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank25]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank48]: output = model(**micro_batch) [default7]:[rank55]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank26]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank48]: sharded_logits = self.model( [default4]:[rank60]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank27]: output = model(**micro_batch) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank28]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: output = model(**micro_batch) [default4]:[rank60]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank24]: output = model(**micro_batch) [default1]:[rank25]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank48]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank26]: sharded_logits = self.model( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: output = model(**micro_batch) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: output = model(**micro_batch) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank59]: pipeline_state.run_communication() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: recv_activation_tensor = recv_activation() [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: sharded_logits = self.model( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: return forward_call(*args, **kwargs) [default3]:[rank59]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank27]: return forward_call(*args, **kwargs) [default2]:[rank26]: return forward_call(*args, **kwargs) [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank60]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: dist.recv( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank25]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: return func(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank26]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank59]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank28]: return forward_call(*args, **kwargs) [default3]:[rank59]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank27]: sharded_logits = self.model( [default4]:[rank60]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank24]: sharded_logits = self.model( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank59]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f15c8371897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank59]: frame #1: + 0x5b3a23e (0x7f1601e8e23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default2]:[rank26]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1601e88c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1601e88f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank60]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank60]: pipeline_state.run_communication() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank25]: pipeline_state.run_communication() [default4]:[rank60]: recv_activation_tensor = recv_activation() [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: sharded_logits = self.model( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1601e89fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1601e3e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1601e3e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1601e3e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1601e3e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank24]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank59]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f15c964b189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f15c9652610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f15c9671978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank27]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank60]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: frame #12: + 0x5adc309 (0x7f1601e30309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #13: + 0x5ae6f10 (0x7f1601e3af10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank60]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank24]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank27]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank59]: frame #14: + 0x5ae6fa5 (0x7f1601e3afa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #15: + 0x5124446 (0x7f1601478446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: dist.recv( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: return func(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank27]: return forward_call(*args, **kwargs) [default2]:[rank26]: pipeline_state.run_communication() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: frame #16: + 0x1acf4b8 (0x7f15fde234b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: frame #17: + 0x5aee004 (0x7f1601e42004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: recv_activation_tensor = recv_activation() [default3]:[rank59]: frame #18: + 0x5af36b5 (0x7f1601e476b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: frame #19: + 0xd2631e (0x7f1614a3131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: frame #20: + 0x47def4 (0x7f1614188ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: frame #21: + 0x1445a6 (0x555648c535a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555648c4ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: recv_activation_tensor = recv_activation() [default0]:[rank24]: return forward_call(*args, **kwargs) [default3]:[rank59]: frame #23: + 0x150866 (0x555648c5f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555648c48142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: return forward_call(*args, **kwargs) [default3]:[rank59]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555648c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: frame #26: PyObject_Call + 0xbc (0x555648c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]:[rank60]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank60]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0038faa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank59]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555648c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555648c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank59]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank59]: frame #30: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #1: + 0x5b3a23e (0x7f0072ac723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: pipeline_state.run_communication() [default3]:[rank59]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: frame #32: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank59]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0072ac1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0072ac1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: frame #34: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555648c4bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555648c5dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #38: + 0x211239 (0x555648d20239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0072ac2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0072a77371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0072a77371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0072a77371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank28]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0072a77371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555648c4ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank25]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555648c483e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: recv_activation_tensor = recv_activation() [default3]:[rank59]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555648c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555648c43c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555648c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank59]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f003a284189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank48]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f003a28b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f003a2aa978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #12: + 0x5adc309 (0x7f0072a69309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #13: + 0x5ae6f10 (0x7f0072a73f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #45: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl bench-cluster/bin/python3.10) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank25]: dist.recv( [default3]:[rank59]: frame #46: PyObject_Call + 0xbc (0x555648c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555648c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: frame #48: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #49: PyObject_Call + 0xbc (0x555648c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555648c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555648c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555648c4c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555648c5dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: return forward_call(*args, **kwargs) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: return forward_call(*args, **kwargs) [default4]:[rank60]: frame #14: + 0x5ae6fa5 (0x7f0072a73fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank60]: frame #15: + 0x5124446 (0x7f00720b1446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: dist.recv( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank59]: frame #54: + 0x211239 (0x555648d20239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: return func(*args, **kwargs) [default0]:[rank24]: pipeline_state.run_communication() [default7]:[rank55]: sharded_logits = self.model( [default4]:[rank60]: frame #16: + 0x1acf4b8 (0x7f006ea5c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #17: + 0x5aee004 (0x7f0072a7b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #18: + 0x5af36b5 (0x7f0072a806b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: frame #19: + 0xd2631e (0x7f008566a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank60]: frame #20: + 0x47def4 (0x7f0084dc1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank28]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank24]: recv_activation_tensor = recv_activation() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank48]: pipeline_state.run_communication() [default4]:[rank60]: frame #21: + 0x1445a6 (0x55c5fc92a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c5fc923a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #23: + 0x150866 (0x55c5fc936866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default3]:[rank59]: frame #55: PyObject_Call + 0x207 (0x555648c60067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555648c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank59]: frame #57: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555648c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank59]: frame #59: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #60: PyObject_Call + 0xbc (0x555648c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank55]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555648c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #62: + 0x150582 (0x555648c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: return func(*args, **kwargs) [default1]:[rank25]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank48]: recv_activation_tensor = recv_activation() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank59]: frame #63: PyObject_Call + 0xbc (0x555648c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c5fc91f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank48]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank59]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank24]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default4]:[rank60]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c5fc92aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: pipeline_state.run_communication() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: frame #26: PyObject_Call + 0xbc (0x55c5fc936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5fc91d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank27]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: Traceback (most recent call last): [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank39]: trainer.train(dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank39]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default4]:[rank60]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c5fc92aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #30: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank55]: new_kwargs[name] = recv_from_pipeline_state_buffer( ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank39]: output = model(**micro_batch) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: return forward_call(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank39]: sharded_logits = self.model( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: Traceback (most recent call last): [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank43]: trainer.train(dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank43]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default4]:[rank60]: frame #32: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #34: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank55]: pipeline_state.run_communication() [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: return forward_call(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank39]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank39]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: return self._call_impl(*args, **kwargs) [defaultron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank43]: output = model(**micro_batch) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c5fc922f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c5fc934c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #38: + 0x211239 (0x55c5fc9f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank48]: dist.recv( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper 7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: return forward_call(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank39]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank39]: pipeline_state.run_communication() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank39]: recv_activation_tensor = recv_activation() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/p[default3]:[rank43]: sharded_logits = self.model( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank43]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank43]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank43][default4]:[rank60]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c5fc923a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c5fc91f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c5fc92aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c5fc91ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: recv_activation_tensor = recv_activation() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication arallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank39]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors : File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank43]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank43]: pipeline_state.run_communication() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel[default4]:[rank60]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c5fc92aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #45: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank48]: return func(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank39]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank39]: dist.recv( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: return func(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank39]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank39/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank43]: recv_activation_tensor = recv_activation() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank43]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank43]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py[default4]:[rank60]: frame #46: PyObject_Call + 0xbc (0x55c5fc936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5fc91d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #48: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #49: PyObject_Call + 0xbc (0x55c5fc936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5fc91d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c5fc92aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c5fc923007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/[default1]:[rank25]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank55]: recv_activation_tensor = recv_activation() [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ ]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default7]:[rank39]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank39]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdc40f8a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank39]: frame #1: + 0x5b3a23e (0x7fdc7aaa723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fdc7aaa1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39", line 246, in _recv_meta [default3]:[rank43]: dist.recv( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank43]: return func(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank43]: pg.recv([tensor], group_src_rank, tag).wait() bin/python3.10) [default4]:[rank60]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c5fc934c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #54: + 0x211239 (0x55c5fc9f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank48]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] ]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdc7aaa1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdc7aaa2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdc7aa57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdc7aa57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdc7aa57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [defau[default3]:[rank43]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]:[rank43]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank43]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc05c9bb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank43]: frame #1: + 0x5b3a23e (0x7fc0964d823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc0964d2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #55: PyObject_Call + 0x207 (0x55c5fc937067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5fc91d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #57: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c5fc91b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #59: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #60: PyObject_Call + 0xbc (0x55c5fc936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5fc91d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bi[default2]:[rank26]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank48]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default0]:[rank48]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): lt7]:[rank39]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdc7aa57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fdc42264189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc0964d2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc0964d3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc096488371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc096488371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc096488371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libto[default3]:[rank27]: dist.recv( [default1]:[rank25]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa0d62c1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank24]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7c4196d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank48]: frame #1: + 0x5b3a23e (0x7f7c7b48a23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fdc4226b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fdc4228a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #12: + 0x5adc309 (0x7fdc7aa49309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #13: + 0x5ae6f10 (0x7fdc7aa53f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rch_cpu.so) [default3]:[rank43]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc096488371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc05dc95189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc05dc9c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc05dcbb978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: [default2]:[rank26]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank55]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: frame #14: + 0x5ae6fa5 (0x7fdc7aa53fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #15: + 0x5124446 (0x7fdc7a091446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0x5adc309 (0x7fc09647a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #13: + 0x5ae6f10 (0x7fc096484f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank48]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7c7b484c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7c7b484f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank39]: frame #16: + 0x1acf4b8 (0x7fdc76a3c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #17: + 0x5aee004 (0x7fdc7aa5b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #14: + 0x5ae6fa5 (0x7fc096484fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #1: + 0x5b3a23e (0x7fa10fdde23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank39]: frame #18: + 0x5af36b5 (0x7fdc7aa606b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #19: + 0xd2631e (0x7fdc8d64a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: frame #20: + 0x47def4 (0x7fdc8cda1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #15: + 0x5124446 (0x7fc095ac2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #16: + 0x1acf4b8 (0x7fc09246d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #17: + 0x5aee004 (0x7fc09648c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #18: + 0x5af36b5 (0x7fc0964916b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #19: + 0xd2631e (0x7fc0a907b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) n/python3.10) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank55]: dist.recv( [default7]:[rank39]: frame #21: + 0x1445a6 (0x560e75b1b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560e75b14a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #23: + 0x150866 (0x560e75b27866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #20: + 0x47def4 (0x7fc0a87d2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #21: + 0x1445a6 (0x55e94db8b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #62: + 0x150582 (0x55c5fc936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #63: PyObject_Call + 0xbc (0x55c5fc936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank27]: return func(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560e75b10142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560e75b1ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55e94db84a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #23: + 0x150866 (0x55e94db97866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e94db80142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa10fdd8c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: return func(*args, **kwargs) [default7]:[rank39]: frame #26: PyObject_Call + 0xbc (0x560e75b27f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560e75b0e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560e75b1ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55e94db8ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #26: PyObject_Call + 0xbc (0x55e94db97f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank39]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #30: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e94db7e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55e94db8ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #30: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: dist.recv( [default2]:[rank26]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3c0e80e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank55]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank39]: frame #32: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #34: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560e75b13f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560e75b25c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #38: + 0x211239 (0x560e75be8239 in /fsx/ferdinandmom/miniforge3/envs/e[default3]:[rank43]: frame #32: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7c7b485fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) nv-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560e75b14a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560e75b103e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #34: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e94db83f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default7]:[rank55]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank39]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560e75b1ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560e75b0bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560e75b1ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55e94db95c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #38: + 0x211239 (0x55e94dc58239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55e94db84a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e94db803e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #1: + 0x5b3a23e (0x7f3c4832b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank39]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #45: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #46: PyObject_Call + 0xbc (0x560e75b27f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560e75b0e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55e94db8ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e94db7bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55e94db8ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank39]: frame #48: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #49: PyObject_Call + 0xbc (0x560e75b27f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560e75b0e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #45: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #46: PyObject_Call + 0xbc (0x55e94db97f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e94db7e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #48: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #49: PyObject_Call + 0xbc (0x55e94db97f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e94db7e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin[default2]:[rank26]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f3c48325c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3c48325f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560e75b1ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560e75b14007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /python3.10) [default3]:[rank43]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55e94db8ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank39]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560e75b25c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #54: + 0x211239 (0x560e75be8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #55: PyObject_Call + 0x207 (0x560e75b28067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560e75b0e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #57: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560e75b0c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #59: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-clust[default3]:[rank43]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e94db84007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55e94db95c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #54: + 0x211239 (0x55e94dc58239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #55: PyObject_Call + 0x207 (0x55e94db98067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e94db7e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank48]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7b43a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7b43a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) er/bin/python3.10) [default3]:[rank43]: frame #57: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e94db7c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #59: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: return func(*args, **kwargs) [default2]:[rank26]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3c48326fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7b43a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8c6f5ff897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank39]: frame #60: PyObject_Call + 0xbc (0x560e75b27f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560e75b0e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #62: + 0x150582 (0x560e75b27582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #63: PyObject_Call + 0xbc (0x560e75b27f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #60: PyObject_Call + 0xbc (0x55e94db97f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e94db7e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #62: + 0x150582 (0x55e94db97582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa10fdd8f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #1: + 0x5b3a23e (0x7f8ca911c23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank43]: frame #63: PyObject_Call + 0xbc (0x55e94db97f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f8ca9116c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: Traceback (most recent call last): [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank46]: trainer.train(dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank46]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank46]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default3]:[rank27]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank28]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank25]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa10fdd9fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa10fd8e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8ca9116f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank46]: output = model(**micro_batch) [default0]:[rank24]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8ca9117fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank46]: sharded_logits = self.model( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.p[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank55]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ca90cc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) y", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank27]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f132368f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank26]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3c482db371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7b43a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank46]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank46]: new_kwargs[name] = recv_from[default1]:[rank25]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa10fd8e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7c42c47189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) _pipeline_state_buffer( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank46]: pipeline_state.run_communication() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank46]: recv_activation_tensor = recv_activation() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank46]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank46]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag[default3]:[rank27]: frame #1: + 0x5b3a23e (0x7f135d1ac23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7c42c4e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7c42c6d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) =tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank46]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank46]: dist.recv( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank46]: return func(*args, **kwargs) [default4]:[rank28]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: frame #12: + 0x5adc309 (0x7f7c7b42c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank46]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank46]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]:[rank46]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank25]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa10fd8e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3c482db371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ca90cc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ca90cc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ca90cc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f967ea5c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank46]: frame #1: + 0x5b3a23e (0x7f96b857923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f135d1a6c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank48]: frame #13: + 0x5ae6f10 (0x7f7c7b436f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f96b8573c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank55]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8c708d9189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f96b8573f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f96b8574fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f96b8529371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f96b8529371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f96b8529371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libto[default1]:[rank25]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa10fd8e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8c708e0610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8c708ff978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rch_cpu.so) [default0]:[rank24]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank48]: frame #14: + 0x5ae6fa5 (0x7f7c7b436fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #15: + 0x5124446 (0x7f7c7aa74446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f96b8529371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: dist.recv( [default7]:[rank55]: frame #12: + 0x5adc309 (0x7f8ca90be309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f967fd36189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f967fd3d610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f967fd5c978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #12: + 0x5adc309 (0x7f96b851b309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f135d1a6f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #13: + 0x5ae6f10 (0x7f8ca90c8f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #13: + 0x5ae6f10 (0x7f96b8525f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3c482db371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3c482db371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #16: + 0x1acf4b8 (0x7f7c7741f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #14: + 0x5ae6fa5 (0x7f96b8525fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #15: + 0x5124446 (0x7f96b7b63446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #16: + 0x1acf4b8 (0x7f96b450e4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa0d759b189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #17: + 0x5aee004 (0x7f7c7b43e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #17: + 0x5aee004 (0x7f96b852d004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #18: + 0x5af36b5 (0x7f96b85326b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: frame #14: + 0x5ae6fa5 (0x7f8ca90c8fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #15: + 0x5124446 (0x7f8ca8706446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #19: + 0xd2631e (0x7f96cb11c31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #20: + 0x47def4 (0x7f96ca873ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #21: + 0x1445a6 (0x563a1f34a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #22: _PyObject_MakeTpCall + 0x26b (0x563a1f343a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #23: + 0x150866 (0x563a1f356866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f343d838897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank26]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3c0fae8189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank27]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f135d1a7fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #16: + 0x1acf4b8 (0x7f8ca50b14b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563a1f33f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563a1f34aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #26: PyObject_Call + 0xbc (0x563a1f356f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: return func(*args, **kwargs) [default0]:[rank48]: frame #18: + 0x5af36b5 (0x7f7c7b4436b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #19: + 0xd2631e (0x7f7c8e02d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: Traceback (most recent call last): [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank38]: trainer.train(dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank38]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank38]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank38]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default6]:[rank46]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563a1f33d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563a1f34aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #30: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa0d75a2610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #20: + 0x47def4 (0x7f7c8d784ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #32: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #1: + 0x5b3a23e (0x7f347735523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #17: + 0x5aee004 (0x7f8ca90d0004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #34: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563a1f342f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #37: _PyObject_Call_Prepend + 0x69 (0x563a1f354c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #38: + 0x211239 (0x563a1f417239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f3c0faef610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #18: + 0x5af36b5 (0x7f8ca90d56b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #19: + 0xd2631e (0x7f8cbbcbf31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: frame #20: + 0x47def4 (0x7f8cbb416ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank38]: output = model(**micro_batch) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank38]: sharded_logits = self.model( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: frame #39: _PyObject_MakeTpCall + 0x26b (0x563a1f343a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563a1f33f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563a1f34aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa0d75c1978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #21: + 0x1445a6 (0x55ca21d765a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55ca21d6fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank38]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank38]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default[default6]:[rank46]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563a1f33ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563a1f34aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank24]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f347734fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #23: + 0x150866 (0x55ca21d82866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ca21d6b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank38]: pipeline_state.run_communication() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: recv_activation_tensor = recv_activation() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/p[default6]:[rank46]: frame #45: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #46: PyObject_Call + 0xbc (0x563a1f356f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563a1f33d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #48: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #49: PyObject_Call + 0xbc (0x563a1f356f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563a1f33d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563a1f34aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyt[default2]:[rank26]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f3c0fb0e978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55ca21d76a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #21: + 0x1445a6 (0x55a1511425a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) arallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank38]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors hon3.10) [default6]:[rank46]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563a1f343007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #53: _PyObject_Call_Prepend + 0x69 (0x563a1f354c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #54: + 0x211239 (0x563a1f417239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #55: PyObject_Call + 0x207 (0x563a1f357067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563a1f33d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #57: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: Traceback (most recent call last): [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: trainer.train(dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank57]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default3]:[rank27]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f135d15c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #26: PyObject_Call + 0xbc (0x55ca21d82f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank38]: dist.recv( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank38]: return func(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank38]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank38[default6]:[rank46]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563a1f33b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #59: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #60: PyObject_Call + 0xbc (0x563a1f356f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563a1f33d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #62: + 0x150582 (0x563a1f356582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f347734ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a15113ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #23: + 0x150866 (0x55a15114e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]:[rank38]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank38]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2264a9d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank38]: frame #1: + 0x5b3a23e (0x7f229e5ba23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f229e5b4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38[default6]:[rank46]: frame #63: PyObject_Call + 0xbc (0x563a1f356f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank25]: frame #12: + 0x5adc309 (0x7fa10fd80309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a151137142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f229e5b4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f229e5b5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f229e56a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #12: + 0x5adc309 (0x7f3c482cd309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a151142a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ca21d692b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f229e56a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f229e56a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f229e56a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2265d77189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f22[default2]:[rank26]: frame #13: + 0x5ae6f10 (0x7f3c482d7f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55ca21d76a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 65d7e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f2265d9d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank24]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3477350fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #30: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #12: + 0x5adc309 (0x7f229e55c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #13: + 0x5ae6f10 (0x7f229e566f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: output = model(**micro_batch) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank28]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank48]: frame #32: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #14: + 0x5ae6fa5 (0x7f229e566fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: sharded_logits = self.model( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotro[default0]:[rank24]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3477305371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #34: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #15: + 0x5124446 (0x7f229dba4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #16: + 0x1acf4b8 (0x7f229a54f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) n/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank57]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward[default3]:[rank27]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f135d15c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f135d15c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5b3a915897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank48]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ca21d6ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #17: + 0x5aee004 (0x7f229e56e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #18: + 0x5af36b5 (0x7f229e5736b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #19: + 0xd2631e (0x7f22b115d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank25]: frame #13: + 0x5ae6f10 (0x7fa10fd8af10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55ca21d80c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #38: + 0x211239 (0x55ca21e43239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55ca21d6fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #20: + 0x47def4 (0x7f22b08b4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: frame #21: + 0x1445a6 (0x5595e45cb5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5595e45c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank57]: pipeline_state.run_communication() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank57]: recv_activation_tensor = recv_activation() [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank57]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank57]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank57[default3]:[rank27]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f135d15c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #26: PyObject_Call + 0xbc (0x55a15114ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #23: + 0x150866 (0x5595e45d7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5595e45c0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5595e45cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank57]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank57]: dist.recv( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: return func(*args, **kwargs) [default4]:[rank28]: frame #1: + 0x5b3a23e (0x7f5b7443223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1511352b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a151142a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #26: PyObject_Call + 0xbc (0x5595e45d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5595e45be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank57]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default2]:[rank26]: frame #14: + 0x5ae6fa5 (0x7f3c482d7fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #30: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ca21d6b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55ca21d76a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5595e45cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #30: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #32: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #34: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env[default1]:[rank57]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank57]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d1660a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank57]: frame #1: + 0x5b3a23e (0x7f0d5012723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1324969189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank27]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1324970610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #32: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #34: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a15113af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) -bench-cluster/bin/python3.10) [default6]:[rank38]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5595e45c3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5595e45d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #38: + 0x211239 (0x5595e4698239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5595e45c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0d50121c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0d50121f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f132498f978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank26]: frame #15: + 0x5124446 (0x7f3c47915446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #14: + 0x5ae6fa5 (0x7fa10fd8afa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a15114cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ca21d66c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55ca21d76a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5595e45c03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5595e45cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5595e45bbc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5595e45cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #45: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0d50122fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0d500d7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0d500d7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #15: + 0x5124446 (0x7fa10f3c8446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #38: + 0x211239 (0x55a15120f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a15113ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #46: PyObject_Call + 0xbc (0x5595e45d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5595e45be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #48: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0d500d7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0d500d7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0d178e4189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank28]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5b7442cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #45: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #49: PyObject_Call + 0xbc (0x5595e45d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5595e45be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5595e45cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5595e45c4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #12: + 0x5adc309 (0x7f135d14e309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a1511373e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5595e45d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #54: + 0x211239 (0x5595e4698239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #55: PyObject_Call + 0x207 (0x5595e45d8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5595e45be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #16: + 0x1acf4b8 (0x7fa10bd734b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #46: PyObject_Call + 0xbc (0x55ca21d82f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ca21d692b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #57: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5595e45bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #59: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5b7442cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a151142a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #60: PyObject_Call + 0xbc (0x5595e45d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5595e45be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #13: + 0x5ae6f10 (0x7f135d158f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #48: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #62: + 0x150582 (0x5595e45d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #63: PyObject_Call + 0xbc (0x5595e45d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank57]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0d178eb610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0d1790a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]: frame #12: + 0x5adc309 (0x7f0d500c9309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #13: + 0x5ae6f10 (0x7f0d500d3f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank26]: frame #16: + 0x1acf4b8 (0x7f3c442c04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a151132c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: Traceback (most recent call last): [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank33]: trainer.train(dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank33]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank33]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank33]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default1]:[rank57]: frame #14: + 0x5ae6fa5 (0x7f0d500d3fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #15: + 0x5124446 (0x7f0d4f711446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #16: + 0x1acf4b8 (0x7f0d4c0bc4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #17: + 0x5aee004 (0x7f0d500db004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #18: + 0x5af36b5 (0x7f0d500e06b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5b7442dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a151142a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank33]: output = model(**micro_batch) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank33]: sharded_logits = self.model( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: frame #19: + 0xd2631e (0x7f0d62cca31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #20: + 0x47def4 (0x7f0d62421ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #21: + 0x1445a6 (0x561a98b4e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #17: + 0x5aee004 (0x7fa10fd92004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3477305371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3477305371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #45: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank33]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank33]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default1]:[rank57]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561a98b47a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #23: + 0x150866 (0x561a98b5a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561a98b43142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561a98b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b743e2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #46: PyObject_Call + 0xbc (0x55a15114ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank33]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank33]: pipeline_state.run_communication() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank33]: recv_activation_tensor = recv_activation() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [defaul[default1]:[rank57]: frame #26: PyObject_Call + 0xbc (0x561a98b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561a98b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561a98b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #17: + 0x5aee004 (0x7f3c482df004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1511352b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #49: PyObject_Call + 0xbc (0x55ca21d82f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) t1]:[rank33]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank33]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank33]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank33]: dist.recv( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank33]: return func(*args, **kwargs) [default1]:[rank57]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #30: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #18: + 0x5af36b5 (0x7f3c482e46b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #48: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank33]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank33]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default1]:[rank33]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank33]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe5c2de3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank33]: frame #1: + 0x5b3a23e (0x7fe5fc90023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #2:[default1]:[rank57]: frame #32: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #34: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #18: + 0x5af36b5 (0x7fa10fd976b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #49: PyObject_Call + 0xbc (0x55a15114ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fe5fc8fac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe5fc8faf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe5fc8fbfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe5fc8b0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe5fc8b0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.[default1]:[rank57]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561a98b46f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561a98b58c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #38: + 0x211239 (0x561a98c1b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b743e2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ca21d692b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe5fc8b0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe5fc8b0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561a98b47a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561a98b433e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561a98b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561a98b3ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3477305371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #14: + 0x5ae6fa5 (0x7f135d158fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1511352b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a151142a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a15113b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe5c40bd189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe5c40c4610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fe5c40e3978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #12: + 0x5adc309 (0x7fe5fc8a2309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #13: [default1]:[rank57]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561a98b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #45: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #46: PyObject_Call + 0xbc (0x561a98b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #19: + 0xd2631e (0x7fa12298131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank48]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55ca21d76a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) + 0x5ae6f10 (0x7fe5fc8acf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561a98b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #48: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #49: PyObject_Call + 0xbc (0x561a98b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f343eb12189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ca21d6f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #14: + 0x5ae6fa5 (0x7fe5fc8acfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #15: + 0x5124446 (0x7fe5fbeea446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561a98b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561a98b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561a98b47007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561a98b58c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #15: + 0x5124446 (0x7f135c796446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55ca21d80c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #16: + 0x1acf4b8 (0x7fe5f88954b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #17: + 0x5aee004 (0x7fe5fc8b4004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #54: + 0x211239 (0x561a98c1b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #55: PyObject_Call + 0x207 (0x561a98b5b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561a98b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #57: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #20: + 0x47def4 (0x7fa1220d8ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank25]: frame #21: + 0x1445a6 (0x55840b3cb5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #18: + 0x5af36b5 (0x7fe5fc8b96b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #19: + 0xd2631e (0x7fe60f4a331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #20: + 0x47def4 (0x7fe60ebfaef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561a98b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #59: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #60: PyObject_Call + 0xbc (0x561a98b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #19: + 0xd2631e (0x7f3c5aece31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #21: + 0x1445a6 (0x56507cd2e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56507cd27a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #23: + 0x150866 (0x56507cd3a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561a98b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #62: + 0x150582 (0x561a98b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #63: PyObject_Call + 0xbc (0x561a98b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank24]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f343eb19610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56507cd23142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56507cd2ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #26: PyObject_Call + 0xbc (0x56507cd3af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #16: + 0x1acf4b8 (0x7f13591414b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55840b3c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a15114cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56507cd212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56507cd2ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #30: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #32: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs[default2]:[rank26]: frame #20: + 0x47def4 (0x7f3c5a625ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank48]: frame #54: + 0x211239 (0x55ca21e43239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #55: PyObject_Call + 0x207 (0x55ca21d83067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b743e2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ca21d692b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #57: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ca21d678fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #34: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56507cd26f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56507cd38c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #38: + 0x211239 (0x56507cdfb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56507cd27a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56507cd233e6 in /fsx/ferdinandmom/miniforge3/envs/en[default0]:[rank24]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f343eb38978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #59: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #54: + 0x211239 (0x55a15120f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) v-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56507cd2ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: Traceback (most recent call last): [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: frame #17: + 0x5aee004 (0x7f135d160004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #18: + 0x5af36b5 (0x7f135d1656b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #55: PyObject_Call + 0x207 (0x55a15114f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56507cd1ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56507cd2ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: trainer.train(dataloader) [default5]:[rank61]: Traceback (most recent call last): [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: frame #19: + 0xd2631e (0x7f136fd4f31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank24]: frame #12: + 0x5adc309 (0x7f34772f7309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1511352b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #45: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #46: PyObject_Call + 0xbc (0x56507cd3af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56507cd212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #48: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank61]: trainer.train(dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank61]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: frame #21: + 0x1445a6 (0x5585de82d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #57: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #49: PyObject_Call + 0xbc (0x56507cd3af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56507cd212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56507cd2ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank26]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5585de826a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #60: PyObject_Call + 0xbc (0x55ca21d82f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56507cd27007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56507cd38c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #54: + 0x211239 (0x56507cdfb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: frame #20: + 0x47def4 (0x7f136f4a6ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a1511338fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #59: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #55: PyObject_Call + 0x207 (0x56507cd3b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56507cd212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #57: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56507cd1f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank63]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: frame #13: + 0x5ae6f10 (0x7f3477301f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #60: PyObject_Call + 0xbc (0x55a15114ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ca21d692b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #59: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #60: PyObject_Call + 0xbc (0x56507cd3af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56507cd212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank28]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b743e2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #62: + 0x150582 (0x55ca21d82582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #62: + 0x150582 (0x56507cd3a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #63: PyObject_Call + 0xbc (0x56507cd3af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: output = model(**micro_batch) [default3]:[rank27]: frame #21: + 0x1445a6 (0x55c39b55b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #63: PyObject_Call + 0xbc (0x55ca21d82f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c39b554a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #23: + 0x150866 (0x55c39b567866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1511352b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c39b550142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #23: + 0x150866 (0x5585de839866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #62: + 0x150582 (0x55a15114e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #63: PyObject_Call + 0xbc (0x55a15114ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank24]: frame #14: + 0x5ae6fa5 (0x7f3477301fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank63]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank24]: frame #15: + 0x5124446 (0x7f347693f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c39b55ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank63]: output = model(**micro_batch) [default2]:[rank26]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5585de822142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5585de82da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5b3bbef189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: frame #26: PyObject_Call + 0xbc (0x55c39b567f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return forward_call(*args, **kwargs) [default1]:[rank25]: frame #23: + 0x150866 (0x55840b3d7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: return forward_call(*args, **kwargs) [default0]:[rank24]: frame #16: + 0x1acf4b8 (0x7f34732ea4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: sharded_logits = self.model( [default4]:[rank28]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5b3bbf6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank28]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5b3bc15978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: frame #17: + 0x5aee004 (0x7f3477309004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: Traceback (most recent call last): [default5]:[rank61]: sharded_logits = self.model( [default2]:[rank26]: frame #26: PyObject_Call + 0xbc (0x5585de839f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55840b3c0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c39b54e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: frame #18: + 0x5af36b5 (0x7f347730e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: frame #12: + 0x5adc309 (0x7f5b743d4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c39b55ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55840b3cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank24]: frame #19: + 0xd2631e (0x7f3489ef831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: return forward_call(*args, **kwargs) [default0]:[rank24]: frame #20: + 0x47def4 (0x7f348964fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank27]: frame #30: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5585de8202b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank50]: trainer.train(dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank50]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank50]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank24]: frame #21: + 0x1445a6 (0x5629c9fb45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank50]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank27]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank50]: output = model(**micro_batch) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank24]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5629c9fada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5585de82da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: sharded_logits = self.model( [default7]:[rank63]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank25]: frame #26: PyObject_Call + 0xbc (0x55840b3d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default5]:[rank61]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank28]: frame #13: + 0x5ae6f10 (0x7f5b743def10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank50]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank25]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55840b3be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank50]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: frame #23: + 0x150866 (0x5629c9fc0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5629c9fa9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank50]: pipeline_state.run_communication() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank50]: recv_activation_tensor = recv_activation() [default7]:[rank63]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank25]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55840b3cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: frame #30: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank17]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default3]:[rank27]: frame #32: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: dist.recv( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank25]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55840b3bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #14: + 0x5ae6fa5 (0x7f5b743defa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: return func(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank50]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank24]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5629c9fb4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default2]:[rank50]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank61]: pipeline_state.run_communication() [default1]:[rank25]: frame #30: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb6296c8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank50]: frame #1: + 0x5b3a23e (0x7fb6631e523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank61]: recv_activation_tensor = recv_activation() ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank17]: output = model(**micro_batch) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb6631dfc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb6631dff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank17]: sharded_logits = self.model( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[default4]:[rank28]: frame #15: + 0x5124446 (0x7f5b73a1c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb6631e0fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb663195371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb663195371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb663195371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb663195371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libt[default7]:[rank63]: return self._call_impl(*args, **kwargs) [rank17]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank17]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank17]: new_kwargs[name] = recv_from_pipeline_s[default3]:[rank27]: frame #34: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) orch_cpu.so) [default2]:[rank50]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb62a9a2189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb62a9a9610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb62a9c8978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #12: + 0x5adc309 (0x7fb663187309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #13: + 0x5ae6f10 (0x7fb663191f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ult1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank17]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank26]: frame #32: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #14: + 0x5ae6fa5 (0x7fb663191fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #15: + 0x5124446 (0x7fb6627cf446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank17]: dist.recv( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank17]: return func(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank17]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank17]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank17]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank17]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa688a63897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10[default4]:[rank28]: frame #16: + 0x1acf4b8 (0x7f5b703c74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #16: + 0x1acf4b8 (0x7fb65f17a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #17: + 0x5aee004 (0x7fb663199004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #18: + 0x5af36b5 (0x7fb66319e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: return forward_call(*args, **kwargs) /site-packages/torch/lib/libc10.so) [default1]:[rank17]: frame #1: + 0x5b3a23e (0x7fa6c258023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa6c257ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa6c257af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa6c257bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #19: + 0xd2631e (0x7fb675d8831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #20: + 0x47def4 (0x7fb6754dfef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #21: + 0x1445a6 (0x55eb7db355a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55eb7db2ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank17]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa6c2530371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa6c2530371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa6c2530371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa6c2530371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa689d3d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench[default0]:[rank24]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5629c9fa72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5629c9fb4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #23: + 0x150866 (0x55eb7db41866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55eb7db2a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55eb7db35a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #26: PyObject_Call + 0xbc (0x55eb7db41f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors -cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank26]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb7db282b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55eb7db35a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #30: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank17]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa689d44610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank17]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa689d63978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank17]: frame #12: + 0x5adc309 (0x7fa6c2522309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #13: + 0x5ae6f10 (0x7fa6c252cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: frame #17: + 0x5aee004 (0x7f5b743e6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #32: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #34: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank17]: frame #14: + 0x5ae6fa5 (0x7fa6c252cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c39b553f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55eb7db2df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55eb7db3fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank17]: frame #15: + 0x5124446 (0x7fa6c1b6a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #16: + 0x1acf4b8 (0x7fa6be5154b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #17: + 0x5aee004 (0x7fa6c2534004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #32: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #18: + 0x5af36b5 (0x7f5b743eb6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #38: + 0x211239 (0x55eb7dc02239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55eb7db2ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55eb7db2a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55eb7db35a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank17]: frame #18: + 0x5af36b5 (0x7fa6c25396b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #19: + 0xd2631e (0x7fa6d512331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank26]: frame #34: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #30: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55eb7db25c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55eb7db35a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #45: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: pipeline_state.run_communication() [default1]:[rank17]: frame #20: + 0x47def4 (0x7fa6d487aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank17]: frame #21: + 0x1445a6 (0x5638e6d475a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #19: + 0xd2631e (0x7f5b86fd531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #46: PyObject_Call + 0xbc (0x55eb7db41f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb7db282b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #48: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #49: PyObject_Call + 0xbc (0x55eb7db41f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank63]: recv_activation_tensor = recv_activation() [default1]:[rank17]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5638e6d40a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #23: + 0x150866 (0x5638e6d53866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5638e6d3c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5638e6d47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #20: + 0x47def4 (0x7f5b8672cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb7db282b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55eb7db35a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55eb7db2e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55eb7db3fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: Traceback (most recent call last): [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank47]: trainer.train(dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank47]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank47]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default5]:[rank61]: dist.recv( [default1]:[rank17]: frame #26: PyObject_Call + 0xbc (0x5638e6d53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5638e6d3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5638e6d47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c39b565c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #54: + 0x211239 (0x55eb7dc02239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #55: PyObject_Call + 0x207 (0x55eb7db42067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb7db282b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #57: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank47]: output = model(**micro_batch) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank47]: sharded_logits = self.model( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank17]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #30: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #32: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #34: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs[default1]:[rank25]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55840b3bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55eb7db268fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #59: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #60: PyObject_Call + 0xbc (0x55eb7db41f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb7db282b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #62: + 0x150582 (0x55eb7db41582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #63: PyObject_Call + 0xbc (0x55eb7db41f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank47]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank47]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors /env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5638e6d3ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5585de825f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank47]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank47]: pipeline_state.run_communication() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank47]: recv_activation_tensor = recv_activation() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [defaul[default5]:[rank61]: return func(*args, **kwargs) [default1]:[rank17]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5638e6d51c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #38: + 0x211239 (0x5638e6e14239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5638e6d40a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5638e6d3c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5638e6d47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5638e6d37c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5638e6d47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench[default1]:[rank25]: frame #34: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) t7]:[rank47]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank47]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank47]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: pg.recv([tensor], group_src_rank, tag).wait() -cluster/bin/python3.10) [default1]:[rank17]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #45: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #46: PyObject_Call + 0xbc (0x5638e6d53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5638e6d3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55840b3bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #32: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #21: + 0x1445a6 (0x55c5f24e45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank47]: dist.recv( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank47]: return func(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank47]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default7]:[rank47]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank47][default7]:[rank63]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank17]: frame #48: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #49: PyObject_Call + 0xbc (0x5638e6d53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5638e6d3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5638e6d47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5638e6d40007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55840b3c3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) : frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef515db897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]: frame #1: + 0x5b3a23e (0x7fef8b0f823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fef8b0f2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fef8b0f2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fef8b0f3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/si[default5]:[rank61]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default5]:[rank61]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank17]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5638e6d51c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #54: + 0x211239 (0x5638e6e14239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #55: PyObject_Call + 0x207 (0x5638e6d54067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5638e6d3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5585de837c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) te-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef8b0a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef8b0a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef8b0a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef8b0a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fef528b5189 in /fsx[default5]:[rank61]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f35695d4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank17]: frame #57: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5638e6d388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #59: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c5f24dda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) /ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fef528bc610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fef528db978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #12: + 0x5adc309 (0x7fef8b09a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #13: + 0x5ae6f10 (0x7fef8b0a4f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank17]: frame #60: PyObject_Call + 0xbc (0x5638e6d53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5638e6d3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: frame #62: + 0x150582 (0x5638e6d53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #23: + 0x150866 (0x55c5f24f0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55840b3d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #38: + 0x211239 (0x55840b498239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #14: + 0x5ae6fa5 (0x7fef8b0a4fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #1: + 0x5b3a23e (0x7f35a30f123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank17]: frame #63: PyObject_Call + 0xbc (0x5638e6d53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank17]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank24]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #15: + 0x5124446 (0x7fef8a6e2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #16: + 0x1acf4b8 (0x7fef8708d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #17: + 0x5aee004 (0x7fef8b0ac004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #18: + 0x5af36b5 (0x7fef8b0b16b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #19: + 0xd2631e (0x7fef9dc9b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank20]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default2]:[rank26]: frame #38: + 0x211239 (0x5585de8fa239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #20: + 0x47def4 (0x7fef9d3f2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank47]: frame #21: + 0x1445a6 (0x56469054d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #22: _PyObject_MakeTpCall + 0x26b (0x564690546a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank61]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f35a30ebc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank20]: output = model(**micro_batch) [default3]:[rank27]: frame #38: + 0x211239 (0x55c39b628239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c5f24d9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #23: + 0x150866 (0x564690559866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564690542142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56469054da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: dist.recv( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank20]: sharded_logits = self.model( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.p[default0]:[rank24]: frame #34: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #26: PyObject_Call + 0xbc (0x564690559f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5646905402b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper y", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank20]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank20]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55840b3c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5585de826a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56469054da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #30: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return func(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank20]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank24]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #32: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #34: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank20]: recv_activation_tensor = recv_activation() [default4]:[rank28]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c5f24e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5585de8223e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564690545f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f35a30ebf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank20]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank20]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank27]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c39b554a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c39b5503e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #37: _PyObject_Call_Prepend + 0x69 (0x564690557c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #38: + 0x211239 (0x56469061a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #39: _PyObject_MakeTpCall + 0x26b (0x564690546a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5646905423e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56469054da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f35a30ecfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank20]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank25]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55840b3c03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56469053dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56469054da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #45: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #46: PyObject_Call + 0xbc (0x564690559f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5646905402b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #48: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-clu[default7]:[rank63]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank20]: dist.recv( [default0]:[rank24]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5629c9facf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5585de82da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ster/bin/python3.10) [default7]:[rank47]: frame #49: PyObject_Call + 0xbc (0x564690559f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd326219897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank61]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35a30a1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank28]: frame #26: PyObject_Call + 0xbc (0x55c5f24f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5646905402b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56469054da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564690546007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #53: _PyObject_Call_Prepend + 0x69 (0x564690557c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35a30a1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35a30a1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35a30a1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #1: + 0x5b3a23e (0x7fd35fd3623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank20]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank24]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5629c9fbec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #38: + 0x211239 (0x5629ca081239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #54: + 0x211239 (0x56469061a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #55: PyObject_Call + 0x207 (0x56469055a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5646905402b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #57: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56469053e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #59: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fd35fd30c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd35fd30f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd35fd31fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank28]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5f24d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #60: PyObject_Call + 0xbc (0x564690559f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5646905402b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #62: + 0x150582 (0x564690559582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #63: PyObject_Call + 0xbc (0x564690559f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd35fce6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank27]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c39b55ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5585de81dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank63]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd35fce6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd35fce6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd35fce6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05378a7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank20]: frame #1: + 0x5b3a23e (0x7f05713c423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55840b3cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f356a8ae189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank20]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f05713bec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f05713bef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank25]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55840b3bbc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd3274f3189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank20]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f05713bffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0571374371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0571374371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c5f24e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f356a8b5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank20]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0571374371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0571374371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank24]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5629c9fada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f356a8d4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fd3274fa610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fd327519978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #12: + 0x5adc309 (0x7fd35fcd8309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0538b81189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank20]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0538b88610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank20]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0538ba7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank25]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55840b3cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: Traceback (most recent call last): [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank36]: Traceback (most recent call last): [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank36]: trainer.train(dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank36]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank36]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank63]: frame #13: + 0x5ae6f10 (0x7fd35fce2f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #12: + 0x5adc309 (0x7f0571366309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #13: + 0x5ae6f10 (0x7f0571370f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank28]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #12: + 0x5adc309 (0x7f35a3093309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #14: + 0x5ae6fa5 (0x7f0571370fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #15: + 0x5124446 (0x7f05709ae446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #16: + 0x1acf4b8 (0x7f056d3594b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank27]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c39b54bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #13: + 0x5ae6f10 (0x7f35a309df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #17: + 0x5aee004 (0x7f0571378004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #18: + 0x5af36b5 (0x7f057137d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #19: + 0xd2631e (0x7f0583f6731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank26]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5585de82da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55840b3bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: sharded_logits = self.model( [default4]:[rank36]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank36]: output = model(**micro_batch) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: frame #14: + 0x5ae6fa5 (0x7fd35fce2fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #20: + 0x47def4 (0x7f05836beef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank20]: frame #21: + 0x1445a6 (0x55c7f33bd5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #30: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank53]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank53]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default4]:[rank36]: sharded_logits = self.model( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank36]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank36]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank36][default7]:[rank63]: frame #15: + 0x5124446 (0x7fd35f320446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c7f33b6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #23: + 0x150866 (0x55c7f33c9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c7f33b2142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c39b55ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl : File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default7]:[rank63]: frame #16: + 0x1acf4b8 (0x7fd35bccb4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #17: + 0x5aee004 (0x7fd35fcea004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #18: + 0x5af36b5 (0x7fd35fcef6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c7f33bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #26: PyObject_Call + 0xbc (0x55c7f33c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #45: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: pipeline_state.run_communication() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank36]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank36]: pipeline_state.run_communication() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank36]: recv_activation_tensor = recv_activation() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank36]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/be[default7]:[rank63]: frame #19: + 0xd2631e (0x7fd3728d931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank20]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c7f33b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c7f33bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #30: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5629c9fa93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5629c9fb4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank53]: recv_activation_tensor = recv_activation() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] nch_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank36]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank36]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank36]: dist.recv( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank36]: return func(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, i[default7]:[rank63]: frame #20: + 0x47def4 (0x7fd372030ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank20]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #32: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: trainer.train(dataloader) [default3]:[rank27]: frame #45: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #45: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank53]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors n recv [default4]:[rank36]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: frame #14: + 0x5ae6fa5 (0x7f35a309dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #34: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #46: PyObject_Call + 0xbc (0x55840b3d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank53]: dist.recv( [default4]:[rank36]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default4]:[rank36]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank36]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f64e3f64897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank63]: frame #21: + 0x1445a6 (0x5558aafba5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c7f33b5f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c7f33c7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #38: + 0x211239 (0x55c7f348a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c7f33b6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #46: PyObject_Call + 0xbc (0x55c39b567f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank53]: return func(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank36]: frame #1: + 0x5b3a23e (0x7f651da8123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #15: + 0x5124446 (0x7f35a26db446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c7f33b23e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c7f33bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c7f33adc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #46: PyObject_Call + 0xbc (0x5585de839f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5585de8202b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5629c9fa4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c39b54e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank53]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]:[rank36]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f651da7bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f651da7bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f651da7cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5558aafb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c7f33bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #45: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #46: PyObject_Call + 0xbc (0x55c7f33c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c7f33b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #48: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #49: PyObject_Call + 0xbc (0x55c7f33c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyt[default4]:[rank28]: frame #32: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank53]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f70d9840897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank36]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f651da31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f651da31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f651da31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f651da31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #16: + 0x1acf4b8 (0x7f359f0864b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) hon3.10) [default4]:[rank20]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c7f33b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #48: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #1: + 0x5b3a23e (0x7f711335d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7113357c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7113357f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f64e523e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f64e5245610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #17: + 0x5aee004 (0x7f35a30a5004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank20]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c7f33bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c7f33b6007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c7f33c7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #54: + 0x211239 (0x55c7f348a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #55: PyObject_Call + 0x207 (0x55c7f33ca067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #49: PyObject_Call + 0xbc (0x5585de839f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7113358fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f711330d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f711330d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f64e5264978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #12: + 0x5adc309 (0x7f651da23309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #13: + 0x5ae6f10 (0x7f651da2df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #18: + 0x5af36b5 (0x7f35a30aa6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #19: + 0xd2631e (0x7f35b5c9431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank20]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c7f33b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #57: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f711330d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f711330d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f70dab1a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f70dab21610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #14: + 0x5ae6fa5 (0x7f651da2dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #15: + 0x5124446 (0x7f651d06b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #16: + 0x1acf4b8 (0x7f6519a164b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #17: + 0x5aee004 (0x7f651da35004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #18: + 0x5af36b5 (0x7f651da3a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #19: + 0xd2631e (0x7f6530624[default5]:[rank61]: frame #20: + 0x47def4 (0x7f35b53ebef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank61]: frame #21: + 0x1445a6 (0x55f3450105a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c7f33ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #59: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #60: PyObject_Call + 0xbc (0x55c7f33c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5629c9fb4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55840b3be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f70dab40978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]: frame #12: + 0x5adc309 (0x7f71132ff309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #13: + 0x5ae6f10 (0x7f7113309f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #20: + 0x47def4 (0x7f652fd7bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank61]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f345009a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c7f33b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #62: + 0x150582 (0x55c7f33c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: frame #63: PyObject_Call + 0xbc (0x55c7f33c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: frame #14: + 0x5ae6fa5 (0x7f7113309fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #15: + 0x5124446 (0x7f7112947446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #16: + 0x1acf4b8 (0x7f710f2f24b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #17: + 0x5aee004 (0x7f7113311004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #18: + 0x5af36b5 (0x7f71133166b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #21: + 0x1445a6 (0x55b022ded5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b022de6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #23: + 0x150866 (0x55b022df9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b022de2142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b022deda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #23: + 0x150866 (0x5558aafc6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank20]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank25]: frame #48: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #34: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #19: + 0xd2631e (0x7f7125f0031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #20: + 0x47def4 (0x7f7125657ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #21: + 0x1445a6 (0x55c49d3825a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #26: PyObject_Call + 0xbc (0x55b022df9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b022de02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b022deda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #30: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #32: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cl[default7]:[rank63]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5558aafaf142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #23: + 0x150866 (0x55f34501c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: frame #45: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #48: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c49d37ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #23: + 0x150866 (0x55c49d38e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c49d377142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c49d382a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #26: PyObject_Call + 0xbc (0x55c49d38ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c49d3752b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c49d382a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-clusteruster/bin/python3.10) [default4]:[rank36]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5558aafbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( /bin/python3.10) [default5]:[rank53]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #30: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #34: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b022de5f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b022df7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #38: + 0x211239 (0x55b022eba239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b022de6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b022de23e6 in /fsx/ferdinandmom/miniforge3/envs/en[default5]:[rank61]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f345005142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: frame #46: PyObject_Call + 0xbc (0x5629c9fc0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #32: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #34: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c49d37af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c49d38cc39 in /fsx/ferdinandmom/miniforge3/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b022deda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b022dddc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b022deda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #45: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #26: PyObject_Call + 0xbc (0x5558aafc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) vs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #46: PyObject_Call + 0xbc (0x55b022df9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b022de02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #48: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #49: PyObject_Call + 0xbc (0x55b022df9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b022de02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b022deda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b022de6007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/[default5]:[rank61]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f345010a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: Traceback (most recent call last): [default2]:[rank26]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5585de8202b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #38: + 0x211239 (0x55c49d44f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c49d37ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c49d3773e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c49d382a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c49d372c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c49d382a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) bin/python3.10) [default4]:[rank36]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b022df7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #26: PyObject_Call + 0xbc (0x55f34501cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5558aafad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: frame #49: PyObject_Call + 0xbc (0x55c39b567f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c39b54e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c39b55ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #45: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #46: PyObject_Call + 0xbc (0x55c49d38ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c49d3752b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #54: + 0x211239 (0x55b022eba239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #55: PyObject_Call + 0x207 (0x55b022dfa067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b022de02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #57: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b022dde8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f3450032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank26]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5585de82da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank31]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: frame #48: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #49: PyObject_Call + 0xbc (0x55c49d38ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c49d3752b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c49d382a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #59: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #60: PyObject_Call + 0xbc (0x55b022df9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b022de02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #62: + 0x150582 (0x55b022df9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5558aafbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5629c9fa72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #48: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c49d37b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c49d38cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #54: + 0x211239 (0x55c49d44f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #55: PyObject_Call + 0x207 (0x55c49d38f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #63: PyObject_Call + 0xbc (0x55b022df9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f345010a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: trainer.train(dataloader) [default3]:[rank27]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c39b554007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c5f24dcf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c49d3752b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #57: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c49d3738fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #59: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #30: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5585de826007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #60: PyObject_Call + 0xbc (0x55c49d38ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c49d3752b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #62: + 0x150582 (0x55c49d38e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #63: PyObject_Call + 0xbc (0x55c49d38ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #32: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank27]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c39b565c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: Traceback (most recent call last): [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank40]: trainer.train(dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank40]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank40]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: frame #34: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c5f24eec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f345008f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: trainer.train(dataloader) [default2]:[rank26]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5585de837c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank31]: output = model(**micro_batch) [default5]:[rank61]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f34501ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: frame #49: PyObject_Call + 0xbc (0x5629c9fc0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #38: + 0x211239 (0x55f3450dd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5629c9fa72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank40]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank40]: output = model(**micro_batch) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clust[default5]:[rank61]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f345009a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: frame #38: + 0x211239 (0x55c5f25b1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl er/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank40]: sharded_logits = self.model( [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank40]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank63]: frame #30: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: frame #49: PyObject_Call + 0xbc (0x55840b3d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5629c9fb4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5629c9fad007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank63]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #32: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank24]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5629c9fbec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank44]: Traceback (most recent call last): [default5]:[rank61]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f3450053e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank28]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c5f24dda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank25]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55840b3be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55840b3cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #34: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: output = model(**micro_batch) [default2]:[rank18]: output = model(**micro_batch) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5558aafb2f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f345010a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: frame #54: + 0x211239 (0x5585de8fa239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5558aafc4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: frame #55: PyObject_Call + 0x207 (0x5585de83a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #54: + 0x211239 (0x5629ca081239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank61]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f345000c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f345010a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: frame #54: + 0x211239 (0x55c39b628239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5585de8202b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: trainer.train(dataloader) [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: frame #45: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #46: PyObject_Call + 0xbc (0x55f34501cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: return forward_call(*args, **kwargs) [default1]:[rank25]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55840b3c4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank44]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: return forward_call(*args, **kwargs) [default7]:[rank63]: frame #38: + 0x211239 (0x5558ab087239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5558aafb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank25]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55840b3d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5558aafaf3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: frame #55: PyObject_Call + 0x207 (0x55c39b568067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f3450032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #48: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank24]: frame #55: PyObject_Call + 0x207 (0x5629c9fc1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank61]: frame #49: PyObject_Call + 0xbc (0x55f34501cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f3450032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f345010a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: return forward_call(*args, **kwargs) [default3]:[rank27]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c39b54e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c5f24d93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5558aafbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank25]: frame #54: + 0x211239 (0x55840b498239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank40]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank63]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5558aafaac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f345009007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank24]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5629c9fa72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank63]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5558aafbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: output = model(**micro_batch) [default4]:[rank28]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c5f24e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: pipeline_state.run_communication() [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: recv_activation_tensor = recv_activation() [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #45: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: sharded_logits = self.model( [default7]:[rank31]: sharded_logits = self.model( [default0]:[rank40]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank63]: frame #46: PyObject_Call + 0xbc (0x5558aafc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: frame #57: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5585de81e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f34501ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c5f24d4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank63]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5558aafad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: frame #55: PyObject_Call + 0x207 (0x55840b3d8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55840b3be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: frame #54: + 0x211239 (0x55f3450dd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank44]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: frame #55: PyObject_Call + 0x207 (0x55f34501d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #48: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #49: PyObject_Call + 0xbc (0x5558aafc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: frame #59: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank42]: Traceback (most recent call last): [default7]:[rank63]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5558aafad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: frame #60: PyObject_Call + 0xbc (0x5585de839f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank63]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5558aafbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c5f24e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: dist.recv( [default7]:[rank63]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5558aafb3007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5558aafc4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5585de8202b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: frame #54: + 0x211239 (0x5558ab087239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #55: PyObject_Call + 0x207 (0x5558aafc7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: return forward_call(*args, **kwargs) [default0]:[rank24]: frame #57: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5629c9fa58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #59: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f3450032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: return forward_call(*args, **kwargs) [default1]:[rank25]: frame #57: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55840b3bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #57: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank25]: frame #59: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank26]: frame #62: + 0x150582 (0x5585de839582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5558aafad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank25]: frame #60: PyObject_Call + 0xbc (0x55840b3d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank42]: trainer.train(dataloader) [default7]:[rank63]: frame #57: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5558aafab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: sharded_logits = self.model( [default1]:[rank25]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55840b3be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #45: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #60: PyObject_Call + 0xbc (0x5629c9fc0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank24]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5629c9fa72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: Traceback (most recent call last): [default4]:[rank44]: output = model(**micro_batch) [default7]:[rank63]: frame #59: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f3450018fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank25]: frame #62: + 0x150582 (0x55840b3d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank25]: frame #63: PyObject_Call + 0xbc (0x55840b3d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank63]: frame #60: PyObject_Call + 0xbc (0x5558aafc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5558aafad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: frame #57: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank63]: frame #62: + 0x150582 (0x5558aafc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #59: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank28]: frame #46: PyObject_Call + 0xbc (0x55c5f24f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: frame #63: PyObject_Call + 0xbc (0x5558aafc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank25]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank24]: frame #62: + 0x150582 (0x5629c9fc0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c39b54c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank61]: frame #60: PyObject_Call + 0xbc (0x55f34501cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f3450032b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: frame #59: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: Traceback (most recent call last): [default2]:[rank42]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank61]: frame #62: + 0x150582 (0x55f34501c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #63: PyObject_Call + 0xbc (0x55f34501cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank37]: Traceback (most recent call last): [default0]:[rank40]: return func(*args, **kwargs) [default5]:[rank61]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank22]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank26]: frame #63: PyObject_Call + 0xbc (0x5585de839f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank62]: Traceback (most recent call last): [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank62]: trainer.train(dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank62]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank62]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank26]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: trainer.train(dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank62]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank31]: return forward_call(*args, **kwargs) [default4]:[rank44]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5f24d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #48: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank58]: Traceback (most recent call last): [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank62]: output = model(**micro_batch) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank40]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank58]: trainer.train(dataloader) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: frame #60: PyObject_Call + 0xbc (0x55c39b567f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: output = model(**micro_batch) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank62]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank27]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c39b54e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank27]: frame #62: + 0x150582 (0x55c39b567582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: trainer.train(dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: Traceback (most recent call last): [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: frame #63: PyObject_Call + 0xbc (0x5629c9fc0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: trainer.train(dataloader) [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank24]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank44]: sharded_logits = self.model( [default0]:[rank40]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank62]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank27]: frame #63: PyObject_Call + 0xbc (0x55c39b567f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe9ebcfa897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank62]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank21]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: frame #49: PyObject_Call + 0xbc (0x55c5f24f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank31]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank28]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5f24d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #1: + 0x5b3a23e (0x7fea2581723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank28]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c5f24e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank42]: return forward_call(*args, **kwargs) [default2]:[rank58]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: return forward_call(*args, **kwargs) [default4]:[rank28]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c5f24dd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank32]: trainer.train(dataloader) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c5f24eec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #54: + 0x211239 (0x55c5f25b1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank40]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fea25811c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank21]: pipeline_state.run_communication() [default4]:[rank28]: frame #55: PyObject_Call + 0x207 (0x55c5f24f1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5f24d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank34]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank28]: frame #57: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c5f24d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fea25811f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: pipeline_state.run_communication() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank62]: recv_activation_tensor = recv_activation() [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank21]: recv_activation_tensor = recv_activation() [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: sharded_logits = self.model( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank18]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank28]: frame #59: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fea25812fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea257c7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea257c7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank62]: dist.recv( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank28]: frame #60: PyObject_Call + 0xbc (0x55c5f24f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c5f24d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank28]: frame #62: + 0x150582 (0x55c5f24f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank34]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea257c7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: return func(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank62]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank22]: return forward_call(*args, **kwargs) [default4]:[rank28]: frame #63: PyObject_Call + 0xbc (0x55c5f24f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: return forward_call(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank62]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default5]:[rank21]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank18]: pipeline_state.run_communication() [default4]:[rank28]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank31]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank62]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank62]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46753a2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank31]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank37]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank37]: output = model(**micro_batch) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fea257c7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank31]: pipeline_state.run_communication() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank32]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe9ecfd4189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe9ecfdb610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fe9ecffa978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #1: + 0x5b3a23e (0x7f46aeebf23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: recv_activation_tensor = recv_activation() [default7]:[rank31]: recv_activation_tensor = recv_activation() [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: return forward_call(*args, **kwargs) [default2]:[rank58]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank31]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank31]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank31]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank31]: dist.recv( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank42]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank16]: Traceback (most recent call last): [default7]:[rank31]: return func(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank31]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank31]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank40]: frame #12: + 0x5adc309 (0x7fea257b9309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank21]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank31]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank31]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fde7cf29897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank31]: frame #1: + 0x5b3a23e (0x7fdeb6a4623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: sharded_logits = self.model( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank31]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fdeb6a40c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdeb6a40f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: output = model(**micro_batch) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank32]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: frame #13: + 0x5ae6f10 (0x7fea257c3f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f46aeeb9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f46aeeb9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank31]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdeb6a41fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdeb69f6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdeb69f6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: return forward_call(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank31]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdeb69f6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdeb69f6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fde7e203189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank31]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fde7e20a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank34]: output = model(**micro_batch) [default2]:[rank42]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank22]: pipeline_state.run_communication() [default7]:[rank31]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fde7e229978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank31]: frame #12: + 0x5adc309 (0x7fdeb69e8309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #13: + 0x5ae6f10 (0x7fdeb69f2f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank32]: output = model(**micro_batch) [default0]:[rank40]: frame #14: + 0x5ae6fa5 (0x7fea257c3fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return forward_call(*args, **kwargs) [default5]:[rank21]: dist.recv( [default7]:[rank31]: frame #14: + 0x5ae6fa5 (0x7fdeb69f2fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #15: + 0x5124446 (0x7fdeb6030446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #16: + 0x1acf4b8 (0x7fdeb29db4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: frame #15: + 0x5124446 (0x7fea24e01446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank31]: frame #17: + 0x5aee004 (0x7fdeb69fa004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #18: + 0x5af36b5 (0x7fdeb69ff6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank31]: frame #19: + 0xd2631e (0x7fdec95e931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: frame #16: + 0x1acf4b8 (0x7fea217ac4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: frame #20: + 0x47def4 (0x7fdec8d40ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: frame #17: + 0x5aee004 (0x7fea257cb004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank22]: recv_activation_tensor = recv_activation() [default7]:[rank31]: frame #21: + 0x1445a6 (0x55d125a265a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55d125a1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank21]: return func(*args, **kwargs) [default7]:[rank31]: frame #23: + 0x150866 (0x55d125a32866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d125a1b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55d125a26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank16]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank31]: frame #26: PyObject_Call + 0xbc (0x55d125a32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d125a192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55d125a26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f46aeebafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank31]: frame #30: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #32: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank31]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default2]:[rank18]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank31]: frame #34: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d125a1ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46aee6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46aee6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank31]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55d125a30c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #38: + 0x211239 (0x55d125af3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55d125a1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank58]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank31]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d125a1b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55d125a26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d125a16c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55d125a26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: return forward_call(*args, **kwargs) [default4]:[rank44]: pipeline_state.run_communication() [default0]:[rank40]: frame #18: + 0x5af36b5 (0x7fea257d06b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #19: + 0xd2631e (0x7fea383ba31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank31]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #45: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #46: PyObject_Call + 0xbc (0x55d125a32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d125a192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank34]: return forward_call(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: Traceback (most recent call last): [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: trainer.train(dataloader) [default2]:[rank58]: pipeline_state.run_communication() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank18]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank31]: frame #48: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #49: PyObject_Call + 0xbc (0x55d125a32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d125a192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: return forward_call(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank40]: frame #20: + 0x47def4 (0x7fea37b11ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: recv_activation_tensor = recv_activation() [default2]:[rank58]: recv_activation_tensor = recv_activation() [default2]:[rank18]: dist.recv( [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank31]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55d125a26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank35]: sharded_logits = self.model( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: frame #21: + 0x1445a6 (0x55b773bf05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46aee6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46aee6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank31]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d125a1f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55d125a30c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank62]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f467667c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank22]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank31]: frame #54: + 0x211239 (0x55d125af3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #55: PyObject_Call + 0x207 (0x55d125a33067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d125a192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #57: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: Traceback (most recent call last): [default2]:[rank58]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank21]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1dc1fed897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank31]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d125a178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank44]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank18]: return func(*args, **kwargs) [default6]:[rank30]: Traceback (most recent call last): [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: trainer.train(dataloader) [default5]:[rank29]: Traceback (most recent call last): [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: sharded_logits = self.model( [default0]:[rank32]: sharded_logits = self.model( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b773be9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: trainer.train(dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank62]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4676683610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f46766a2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank16]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: frame #59: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #60: PyObject_Call + 0xbc (0x55d125a32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d125a192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: frame #23: + 0x150866 (0x55b773bfc866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank58]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank22]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank29]: trainer.train(dataloader) [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank22]: dist.recv( [default7]:[rank31]: frame #62: + 0x150582 (0x55d125a32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank31]: frame #63: PyObject_Call + 0xbc (0x55d125a32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b773be5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b773bf0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #12: + 0x5adc309 (0x7f46aee61309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: output = model(**micro_batch) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: dist.recv( [default5]:[rank21]: frame #1: + 0x5b3a23e (0x7f1dfbb0a23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank31]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank62]: frame #13: + 0x5ae6f10 (0x7f46aee6bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: return func(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank51]: Traceback (most recent call last): [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: return forward_call(*args, **kwargs) [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank44]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank58]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank18]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank37]: sharded_logits = self.model( [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: frame #14: + 0x5ae6fa5 (0x7f46aee6bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank41]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank62]: frame #15: + 0x5124446 (0x7f46ae4a9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank29]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default5]:[rank21]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1dfbb04c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: Traceback (most recent call last): [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank54]: trainer.train(dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank54]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank51]: trainer.train(dataloader) [default2]:[rank42]: pipeline_state.run_communication() [default0]:[rank40]: frame #26: PyObject_Call + 0xbc (0x55b773bfcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank18]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank22]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank29]: output = model(**micro_batch) [default6]:[rank54]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank54]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: output = model(**micro_batch) [default4]:[rank44]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank62]: frame #16: + 0x1acf4b8 (0x7f46aae544b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #17: + 0x5aee004 (0x7f46aee73004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #18: + 0x5af36b5 (0x7f46aee786b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #19: + 0xd2631e (0x7f46c1a6231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank21]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1dfbb04f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: output = model(**micro_batch) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank62]: frame #20: + 0x47def4 (0x7f46c11b9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: frame #21: + 0x1445a6 (0x55b168f375a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b168f30a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1dfbb05fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank41]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank45]: output = model(**micro_batch) [default2]:[rank58]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5c39ff5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: frame #1: + 0x5b3a23e (0x7f5c73b1223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank42]: recv_activation_tensor = recv_activation() [default2]:[rank58]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5c73b0cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank51]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5c73b0cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return forward_call(*args, **kwargs) [default6]:[rank30]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank51]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank42]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank58]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5c73b0dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #23: + 0x150866 (0x55b168f43866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: return forward_call(*args, **kwargs) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b168f2c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b168f37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60f8d3e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: sharded_logits = self.model( [default0]:[rank32]: return forward_call(*args, **kwargs) [default2]:[rank58]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5c73ac2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5c73ac2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5c73ac2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5c73ac2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank30]: sharded_logits = self.model( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: frame #26: PyObject_Call + 0xbc (0x55b168f43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b168f2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b168f37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f17fafae897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank35]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank40]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b773be32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #30: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #1: + 0x5b3a23e (0x7f613285b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank51]: output = model(**micro_batch) [default0]:[rank32]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b773bf0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5c3b2cf189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank16]: sharded_logits = self.model( [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank37]: return forward_call(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank37]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank45]: return forward_call(*args, **kwargs) [default4]:[rank44]: dist.recv( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5c3b2d6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank22]: frame #1: + 0x5b3a23e (0x7f1834acb23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank58]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5c3b2f5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank58]: frame #12: + 0x5adc309 (0x7f5c73ab4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6132855c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6132855f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank32]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: frame #13: + 0x5ae6f10 (0x7f5c73abef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1834ac5c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1dfbaba371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: sharded_logits = self.model( [default3]:[rank51]: sharded_logits = self.model( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank62]: frame #32: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1dfbaba371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6132856fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank62]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1dfbaba371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank41]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank62]: frame #34: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1dfbaba371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank62]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b168f2ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1dc32c7189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: frame #30: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #32: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b168f41c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #38: + 0x211239 (0x55b169004239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1834ac5f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: sharded_logits = self.model( [default1]:[rank41]: output = model(**micro_batch) [default4]:[rank44]: return func(*args, **kwargs) [default6]:[rank62]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b168f30a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b168f2c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1dc32ce610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: return forward_call(*args, **kwargs) [default3]:[rank51]: return forward_call(*args, **kwargs) [default2]:[rank34]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b168f37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b168f27c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1834ac6fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank29]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return forward_call(*args, **kwargs) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank62]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b168f37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1dc32ed978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank21]: frame #12: + 0x5adc309 (0x7f1dfbaac309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #13: + 0x5ae6f10 (0x7f1dfbab6f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank32]: return forward_call(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank62]: frame #45: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f613280b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f613280b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: frame #46: PyObject_Call + 0xbc (0x55b168f43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b168f2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #48: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #49: PyObject_Call + 0xbc (0x55b168f43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b168f2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f613280b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: pipeline_state.run_communication() [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank35]: return forward_call(*args, **kwargs) [default4]:[rank44]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank62]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b168f37a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b168f30007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f613280b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: recv_activation_tensor = recv_activation() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank32]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b168f41c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #54: + 0x211239 (0x55b169004239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1834a7b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank54]: pipeline_state.run_communication() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank35]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank45]: return forward_call(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: frame #55: PyObject_Call + 0x207 (0x55b168f44067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1834a7b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: recv_activation_tensor = recv_activation() [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: return forward_call(*args, **kwargs) [default2]:[rank42]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #14: + 0x5ae6fa5 (0x7f5c73abefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #15: + 0x5124446 (0x7f5c730fc446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1834a7b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return forward_call(*args, **kwargs) [default5]:[rank21]: frame #14: + 0x5ae6fa5 (0x7f1dfbab6fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank32]: pipeline_state.run_communication() [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank62]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b168f2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f60fa018189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank62]: frame #57: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #15: + 0x5124446 (0x7f1dfb0f4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank35]: pipeline_state.run_communication() [default0]:[rank40]: frame #34: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b168f288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #59: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f60fa01f610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank54]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #60: PyObject_Call + 0xbc (0x55b168f43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1834a7b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank34]: pipeline_state.run_communication() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank62]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b168f2a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #16: + 0x1acf4b8 (0x7f5c6faa74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #16: + 0x1acf4b8 (0x7f1df7a9f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank35]: recv_activation_tensor = recv_activation() [default5]:[rank45]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank44]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]:[rank62]: frame #62: + 0x150582 (0x55b168f43582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #63: PyObject_Call + 0xbc (0x55b168f43f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank21]: frame #17: + 0x5aee004 (0x7f1dfbabe004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank54]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank32]: recv_activation_tensor = recv_activation() [default4]:[rank44]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank58]: frame #17: + 0x5aee004 (0x7f5c73ac6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #18: + 0x5af36b5 (0x7f5c73acb6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #19: + 0xd2631e (0x7f5c866b531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #20: + 0x47def4 (0x7f5c85e0cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #21: + 0x1445a6 (0x55de096155a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #18: + 0x5af36b5 (0x7f1dfbac36b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f17fc288189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank30]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank42]: dist.recv( [default2]:[rank58]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55de0960ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #23: + 0x150866 (0x55de09621866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55de0960a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55de09615a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #26: PyObject_Call + 0xbc (0x55de09621f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f17fc28f610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank29]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank35]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55de096082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55de09615a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #30: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #19: + 0xd2631e (0x7f1e0e6ad31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank54]: dist.recv( [default3]:[rank51]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank34]: recv_activation_tensor = recv_activation() [default0]:[rank40]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b773be8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #32: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #34: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f60fa03e978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: dist.recv( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7f9d545897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55de0960df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55de0961fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #38: + 0x211239 (0x55de096e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f17fc2ae978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank30]: return func(*args, **kwargs) [default5]:[rank29]: pipeline_state.run_communication() [default6]:[rank54]: return func(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank44]: frame #1: + 0x5b3a23e (0x7f7fd706223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: return forward_call(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55de0960ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55de0960a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55de09615a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55de09605c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #20: + 0x47def4 (0x7f1e0de04ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank30]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b773bfac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55de09615a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #45: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #46: PyObject_Call + 0xbc (0x55de09621f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #21: + 0x1445a6 (0x55b9930405a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #12: + 0x5adc309 (0x7f1834a6d309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7fd705cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55de096082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #48: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #49: PyObject_Call + 0xbc (0x55de09621f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55de096082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #13: + 0x5ae6f10 (0x7f1834a77f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank32]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank37]: pipeline_state.run_communication() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank58]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55de09615a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55de0960e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55de0961fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b993039a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank30]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7bc2b2a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank51]: pipeline_state.run_communication() [default6]:[rank54]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank34]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: frame #54: + 0x211239 (0x55de096e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #55: PyObject_Call + 0x207 (0x55de09622067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55de096082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #23: + 0x150866 (0x55b99304c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #1: + 0x5b3a23e (0x7f7bfc64723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e94a51897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank42]: return func(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank58]: frame #57: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55de096068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #59: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #60: PyObject_Call + 0xbc (0x55de09621f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: Traceback (most recent call last): [default6]:[rank30]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7bfc641c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #1: + 0x5b3a23e (0x7f0ece56e23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0ece568c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank44]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7fd705cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55de096082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #62: + 0x150582 (0x55de09621582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #63: PyObject_Call + 0xbc (0x55de09621f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank29]: recv_activation_tensor = recv_activation() [default6]:[rank54]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0ece568f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0ece569fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank40]: frame #38: + 0x211239 (0x55b773cbd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: trainer.train(dataloader) [default6]:[rank30]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7bfc641f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0ece51e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank32]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b773be9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: sharded_logits = self.model( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: frame #14: + 0x5ae6fa5 (0x7f1834a77fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #12: + 0x5adc309 (0x7f61327fd309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank54]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0ece51e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0ece51e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0ece51e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank45]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank30]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7bfc642fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: recv_activation_tensor = recv_activation() [default0]:[rank40]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b773be53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b993035142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: recv_activation_tensor = recv_activation() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank18]: frame #13: + 0x5ae6f10 (0x7f6132807f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bfc5f7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bfc5f7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0e95d2b189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7fd705dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default6]:[rank22]: frame #15: + 0x5124446 (0x7f18340b5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #16: + 0x1acf4b8 (0x7f1830a604b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bfc5f7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bfc5f7371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank35]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b773bf0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #17: + 0x5aee004 (0x7f1834a7f004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: frame #14: + 0x5ae6fa5 (0x7f6132807fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7bc3e04189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0e95d32610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: frame #15: + 0x5124446 (0x7f6131e45446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #18: + 0x5af36b5 (0x7f1834a846b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank51]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank51]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank34]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7fd7012371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7fd7012371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: frame #16: + 0x1acf4b8 (0x7f612e7f04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7bc3e0b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank30]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7bc3e2a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank54]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0e95d51978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: dist.recv( [default2]:[rank42]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank21]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b993040a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #26: PyObject_Call + 0xbc (0x55b99304cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #12: + 0x5adc309 (0x7f7bfc5e9309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: dist.recv( [default5]:[rank37]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: frame #17: + 0x5aee004 (0x7f613280f004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #18: + 0x5af36b5 (0x7f61328146b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank54]: frame #12: + 0x5adc309 (0x7f0ece510309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b773be0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9930332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank30]: frame #13: + 0x5ae6f10 (0x7f7bfc5f3f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: dist.recv( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank32]: return func(*args, **kwargs) [default4]:[rank44]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7fd7012371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank29]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank54]: frame #13: + 0x5ae6f10 (0x7f0ece51af10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank37]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank34]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank41]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: frame #19: + 0xd2631e (0x7f184766e31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank29]: dist.recv( [default3]:[rank51]: return func(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank35]: return func(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7fd7012371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7f9e81f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: frame #19: + 0xd2631e (0x7f61453fe31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank30]: frame #14: + 0x5ae6fa5 (0x7f7bfc5f3fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank44]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7f9e826610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank18]: frame #20: + 0x47def4 (0x7f6144b55ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank29]: return func(*args, **kwargs) [default3]:[rank51]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank34]: dist.recv( [default0]:[rank40]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b773bf0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default2]:[rank18]: frame #21: + 0x1445a6 (0x5599a56255a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank22]: frame #20: + 0x47def4 (0x7f1846dc5ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank21]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b993040a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank29]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: frame #14: + 0x5ae6fa5 (0x7f0ece51afa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank29]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank51]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1068fe7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: return func(*args, **kwargs) [default0]:[rank40]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #45: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank29]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank29]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76f04cb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank54]: frame #15: + 0x5124446 (0x7f0ecdb58446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7f9e845978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank21]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #15: + 0x5124446 (0x7f7bfbc31446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #16: + 0x1acf4b8 (0x7f0eca5034b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #1: + 0x5b3a23e (0x7f10a2b0423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: pipeline_state.run_communication() [default1]:[rank41]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank23]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5599a561ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #16: + 0x1acf4b8 (0x7f7bf85dc4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f10a2afec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank30]: frame #17: + 0x5aee004 (0x7f7bfc5fb004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #18: + 0x5af36b5 (0x7f7bfc6006b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f10a2afef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #17: + 0x5aee004 (0x7f0ece522004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank32]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: frame #46: PyObject_Call + 0xbc (0x55b773bfcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: output = model(**micro_batch) [default6]:[rank30]: frame #19: + 0xd2631e (0x7f7c0f1ea31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f10a2afffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank40]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b773be32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #30: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #1: + 0x5b3a23e (0x7f7729fe823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #18: + 0x5af36b5 (0x7f0ece5276b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank37]: dist.recv( [default5]:[rank45]: recv_activation_tensor = recv_activation() [default1]:[rank41]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank40]: frame #48: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #23: + 0x150866 (0x5599a5631866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #20: + 0x47def4 (0x7f7c0e941ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f10a2ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank44]: frame #12: + 0x5adc309 (0x7f7fd7004309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: frame #21: + 0x1445a6 (0x561fd02b45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561fd02ada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #19: + 0xd2631e (0x7f0ee111131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f7729fe2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #20: + 0x47def4 (0x7f0ee0868ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6398f37897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank18]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5599a561a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5599a5625a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #23: + 0x150866 (0x561fd02c0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #21: + 0x1445a6 (0x563b114f15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #22: _PyObject_MakeTpCall + 0x26b (0x563b114eaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: frame #49: PyObject_Call + 0xbc (0x55b773bfcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b773be32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #21: + 0x1445a6 (0x5651896a65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7729fe2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #23: + 0x150866 (0x563b114fd866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank40]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b773bf0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank45]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: frame #1: + 0x5b3a23e (0x7f63d2a5423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561fd02a9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561fd02b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f10a2ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c87363897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank44]: frame #13: + 0x5ae6f10 (0x7f7fd700ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #26: PyObject_Call + 0xbc (0x561fd02c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f10a2ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563b114e6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank18]: frame #26: PyObject_Call + 0xbc (0x5599a5631f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a56182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561fd02a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561fd02b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f10a2ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc3ca0c6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank42]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f63d2a4ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: return forward_call(*args, **kwargs) [default6]:[rank30]: frame #30: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f106a2c1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: frame #1: + 0x5b3a23e (0x7f0cc0e8023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank21]: frame #32: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #32: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f106a2c8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: return func(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b773be9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f63d2a4ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #34: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563b114f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #1: + 0x5b3a23e (0x7fc403be323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f63d2a4ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank40]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b773bfac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5599a5625a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561fd02acf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #26: PyObject_Call + 0xbc (0x563b114fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0cc0e7ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank29]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7729fe3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7729f98371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563b114e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f106a2e7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0cc0e7af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0515450897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank40]: frame #54: + 0x211239 (0x55b773cbd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank29]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7729f98371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #12: + 0x5adc309 (0x7f10a2aa6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #13: + 0x5ae6f10 (0x7f10a2ab0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #1: + 0x5b3a23e (0x7f054ef6d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #55: PyObject_Call + 0x207 (0x55b773bfd067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: output = model(**micro_batch) [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561fd02bec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563b114f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc403bddc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: dist.recv( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56518969fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #30: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7729f98371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7729f98371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0cc0e7bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: return forward_call(*args, **kwargs) [default0]:[rank40]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b773be32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #34: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f76f17a5189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #14: + 0x5ae6fa5 (0x7f10a2ab0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0cc0e30371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63d2a04371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b993038f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f76f17ac610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank29]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f76f17cb978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #30: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc403bddf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63d2a04371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank30]: frame #38: + 0x211239 (0x561fd0381239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #32: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank32]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0cc0e30371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63d2a04371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #12: + 0x5adc309 (0x7f7729f8a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #34: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc403bdefd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc403b93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank18]: frame #32: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b99304ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #13: + 0x5ae6f10 (0x7f7729f94f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank30]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561fd02ada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #15: + 0x5124446 (0x7f10a20ee446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0cc0e30371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #14: + 0x5ae6fa5 (0x7f7fd700efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #15: + 0x5124446 (0x7f7fd664c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: pipeline_state.run_communication() [default6]:[rank30]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561fd02a93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561fd02b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63d2a04371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #16: + 0x1acf4b8 (0x7f7fd2ff74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank30]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561fd02a4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563b114e9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0cc0e30371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0c8863d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: return func(*args, **kwargs) [default0]:[rank40]: frame #57: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b773be18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #38: + 0x211239 (0x55b99310d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561fd02b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #37: _PyObject_Call_Prepend + 0x69 (0x563b114fbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #38: + 0x211239 (0x563b115be239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #39: _PyObject_MakeTpCall + 0x26b (0x563b114eaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc403b93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank18]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #45: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563b114e63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa35af10897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank18]: frame #34: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #46: PyObject_Call + 0xbc (0x561fd02c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #14: + 0x5ae6fa5 (0x7f7729f94fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563b114f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563b114e1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563b114f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0c88644610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #59: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank30]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561fd02a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #16: + 0x1acf4b8 (0x7f109ea994b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0c88663978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #17: + 0x5aee004 (0x7f7fd7016004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #18: + 0x5af36b5 (0x7f7fd701b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #23: + 0x150866 (0x5651896b2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #48: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #49: PyObject_Call + 0xbc (0x561fd02c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #45: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f054ef67c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: pipeline_state.run_communication() [default6]:[rank22]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56518969b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561fd02a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #46: PyObject_Call + 0xbc (0x563b114fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563b114e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #48: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #49: PyObject_Call + 0xbc (0x563b114fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #1: + 0x5b3a23e (0x7fa394a2d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa394a27c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: frame #60: PyObject_Call + 0xbc (0x55b773bfcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: frame #15: + 0x5124446 (0x7f77295d2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563b114e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #12: + 0x5adc309 (0x7f0cc0e22309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #13: + 0x5ae6f10 (0x7f0cc0e2cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b773be32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b993039a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561fd02b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561fd02ad007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563b114f1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563b114ea007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f054ef67f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank21]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b9930353e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561fd02bec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #53: _PyObject_Call_Prepend + 0x69 (0x563b114fbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f054ef68fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc403b93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #14: + 0x5ae6fa5 (0x7f0cc0e2cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f639a211189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank21]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b993040a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: frame #54: + 0x211239 (0x561fd0381239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #55: PyObject_Call + 0x207 (0x561fd02c1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561fd02a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #17: + 0x5aee004 (0x7f10a2ab8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #18: + 0x5af36b5 (0x7f10a2abd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f054ef1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa394a27f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #19: + 0xd2631e (0x7f7fe9c0531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: frame #20: + 0x47def4 (0x7f7fe935cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: frame #16: + 0x1acf4b8 (0x7f7725f7d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank29]: frame #17: + 0x5aee004 (0x7f7729f9c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #19: + 0xd2631e (0x7f10b56a731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #20: + 0x47def4 (0x7f10b4dfeef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank37]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa394a28fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #62: + 0x150582 (0x55b773bfc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #63: PyObject_Call + 0xbc (0x55b773bfcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5599a561df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #57: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561fd02a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #54: + 0x211239 (0x563b115be239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #15: + 0x5124446 (0x7f0cc046a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank18]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5599a562fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #59: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #21: + 0x1445a6 (0x562a6085d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa3949dd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default0]:[rank40]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank44]: frame #21: + 0x1445a6 (0x5632243625a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56322435ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5651896a6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #26: PyObject_Call + 0xbc (0x5651896b2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #18: + 0x5af36b5 (0x7f7729fa16b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #55: PyObject_Call + 0x207 (0x563b114fe067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f054ef1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f054ef1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f639a218610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank42]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f639a237978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank21]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b993030c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b993040a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #60: PyObject_Call + 0xbc (0x561fd02c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561fd02a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #62: + 0x150582 (0x561fd02c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank30]: frame #63: PyObject_Call + 0xbc (0x561fd02c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563b114e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #22: _PyObject_MakeTpCall + 0x26b (0x562a60856a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f054ef1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa3949dd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa3949dd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: recv_activation_tensor = recv_activation() [default7]:[rank23]: return forward_call(*args, **kwargs) [default6]:[rank30]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank54]: frame #57: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #16: + 0x1acf4b8 (0x7f0cbce154b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #17: + 0x5aee004 (0x7f0cc0e34004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc403b93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank18]: frame #38: + 0x211239 (0x5599a56f2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5599a561ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #19: + 0xd2631e (0x7f773cb8b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank29]: frame #20: + 0x47def4 (0x7f773c2e2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank29]: frame #21: + 0x1445a6 (0x5650605be5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563b114e28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa3949dd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #12: + 0x5adc309 (0x7f63d29f6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: recv_activation_tensor = recv_activation() [default5]:[rank29]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5650605b7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #23: + 0x150866 (0x5650605ca866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5650605b3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #23: + 0x150866 (0x562a60869866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562a60852142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #18: + 0x5af36b5 (0x7f0cc0e396b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #23: + 0x150866 (0x56322436e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563224357142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank29]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5650605bea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #26: PyObject_Call + 0xbc (0x5650605caf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5650605b12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #25: _PyFunction_Vectorcall + 0x6c (0x562a6085da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #26: PyObject_Call + 0xbc (0x562a60869f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #19: + 0xd2631e (0x7f0cd3a2331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f051672a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0516731610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563224362a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #13: + 0x5ae6f10 (0x7f63d2a00f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #45: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5650605bea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #30: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562a608502b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa35c1ea189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank29]: frame #32: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #34: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #28: _PyFunction_Vectorcall + 0x6c (0x562a6085da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #59: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa35c1f1610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #26: PyObject_Call + 0xbc (0x56322436ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank23]: sharded_logits = self.model( [default5]:[rank29]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5650605b6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5650605c8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #38: + 0x211239 (0x56506068b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5650605b7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #60: PyObject_Call + 0xbc (0x563b114fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563b114e42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc3cb3a0189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank34]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc3cb3a7610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5650605b33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5650605bea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5650605aec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #30: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #20: + 0x47def4 (0x7f0cd317aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5632243552b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563224362a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5599a561a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5599a5625a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5650605bea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #45: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #46: PyObject_Call + 0xbc (0x5650605caf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #62: + 0x150582 (0x563b114fd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #63: PyObject_Call + 0xbc (0x563b114fdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc3cb3c6978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fada0ac0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank41]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank19]: sharded_logits = self.model( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5650605b12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #48: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #49: PyObject_Call + 0xbc (0x5650605caf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5650605b12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa35c210978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #30: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: frame #14: + 0x5ae6fa5 (0x7f63d2a00fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank29]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5650605bea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5650605b7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5650605c8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #32: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #21: + 0x1445a6 (0x55d1f2c5e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #1: + 0x5b3a23e (0x7fadda5dd23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: frame #54: + 0x211239 (0x56506068b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #55: PyObject_Call + 0x207 (0x5650605cb067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5650605b12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #57: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #34: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562a60855f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55d1f2c57a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #12: + 0x5adc309 (0x7fc403b85309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0516750978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fadda5d7c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5650605af8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #59: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #60: PyObject_Call + 0xbc (0x5650605caf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5650605b12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #37: _PyObject_Call_Prepend + 0x69 (0x562a60867c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #38: + 0x211239 (0x562a6092a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #39: _PyObject_MakeTpCall + 0x26b (0x562a60856a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562a608523e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #12: + 0x5adc309 (0x7f054ef0f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #32: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5651896992b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #62: + 0x150582 (0x5650605ca582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: frame #63: PyObject_Call + 0xbc (0x5650605caf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank29]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank51]: frame #41: _PyFunction_Vectorcall + 0x6c (0x562a6085da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562a6084dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #43: _PyFunction_Vectorcall + 0x6c (0x562a6085da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #23: + 0x150866 (0x55d1f2c6a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fadda5d7f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5651896a6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #45: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #46: PyObject_Call + 0xbc (0x562a60869f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562a608502b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #48: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #49: PyObject_Call + 0xbc (0x562a60869f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #13: + 0x5ae6f10 (0x7f054ef19f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #12: + 0x5adc309 (0x7fa3949cf309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #13: + 0x5ae6f10 (0x7fa3949d9f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #15: + 0x5124446 (0x7f63d203e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5599a5615c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562a608502b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #51: _PyFunction_Vectorcall + 0x6c (0x562a6085da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562a60856007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #13: + 0x5ae6f10 (0x7fc403b8ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #16: + 0x1acf4b8 (0x7f63ce9e94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: frame #53: _PyObject_Call_Prepend + 0x69 (0x562a60867c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #54: + 0x211239 (0x562a6092a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #55: PyObject_Call + 0x207 (0x562a6086a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562a608502b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d1f2c53142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #34: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank51]: frame #57: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562a6084e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #59: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #60: PyObject_Call + 0xbc (0x562a60869f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562a608502b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #62: + 0x150582 (0x562a60869582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #63: PyObject_Call + 0xbc (0x562a60869f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyt[default0]:[rank32]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55d1f2c5ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank42]: frame #17: + 0x5aee004 (0x7f63d2a08004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default5]:[rank21]: frame #46: PyObject_Call + 0xbc (0x55b99304cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) hon3.10) [default3]:[rank51]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank34]: frame #14: + 0x5ae6fa5 (0x7fc403b8ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fadda5d8fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #30: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: Traceback (most recent call last): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank49]: trainer.train(dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank49]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank49]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default2]:[rank34]: frame #15: + 0x5124446 (0x7fc4031cd446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadda58d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadda58d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank23]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank49]: output = model(**micro_batch) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank49]: sharded_logits = self.model( [default3]:[rank35]: frame #14: + 0x5ae6fa5 (0x7f054ef19fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56322435af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #18: + 0x5af36b5 (0x7f63d2a0d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank49]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank49]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default3]:[rank35]: frame #15: + 0x5124446 (0x7f054e557446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #16: + 0x1acf4b8 (0x7fc3ffb784b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #14: + 0x5ae6fa5 (0x7fa3949d9fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #15: + 0x5124446 (0x7fa394017446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadda58d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: meta = self._recv_meta(from_rank=from_rank, tag=tag) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank49]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communi[default2]:[rank34]: frame #17: + 0x5aee004 (0x7fc403b97004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #19: + 0xd2631e (0x7f63e55f731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank22]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #32: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cation [default1]:[rank49]: recv_activation_tensor = recv_activation() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank49]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank49]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank49]: di[default5]:[rank37]: frame #16: + 0x1acf4b8 (0x7fa3909c24b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #16: + 0x1acf4b8 (0x7f054af024b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadda58d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5599a5625a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) st.recv( [default0]:[rank32]: frame #26: PyObject_Call + 0xbc (0x55d1f2c6af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fada1d9a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56322436cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank49]: return func(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank49]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default1]:[rank49]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank49]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa369a9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank32]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d1f2c512b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #38: + 0x211239 (0x56322442f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #20: + 0x47def4 (0x7f63e4d4eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank21]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9930332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #1: + 0x5b3a23e (0x7faa704c623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7faa704c0c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7faa704c0f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #17: + 0x5aee004 (0x7f054ef21004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #21: + 0x1445a6 (0x5580c86735a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #48: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank22]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #34: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7faa704c1fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa70476371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa70476371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa70476371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa70476371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libt[default3]:[rank35]: frame #18: + 0x5af36b5 (0x7f054ef266b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fada1da1610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank21]: frame #49: PyObject_Call + 0xbc (0x55b99304cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) orch_cpu.so) [default2]:[rank34]: frame #18: + 0x5af36b5 (0x7fc403b9c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56322435ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5632243573e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: dist.recv( [default1]:[rank49]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7faa37c83189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7faa37c8a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7faa37ca9978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #12: + 0x5adc309 (0x7faa70468309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #13: [default2]:[rank34]: frame #19: + 0xd2631e (0x7fc41678631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fada1dc0978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank23]: return self._call_impl(*args, **kwargs) + 0x5ae6f10 (0x7faa70472f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #17: + 0x5aee004 (0x7fa3949e1004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #18: + 0x5af36b5 (0x7fa3949e66b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #19: + 0xd2631e (0x7fa3a75d031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #12: + 0x5adc309 (0x7fadda57f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #45: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #14: + 0x5ae6fa5 (0x7faa70472fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #15: + 0x5124446 (0x7faa6fab0446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #16: + 0x1acf4b8 (0x7faa6c45b4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #17: + 0x5aee004 (0x7faa7047a004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #20: + 0x47def4 (0x7fc415eddef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #19: + 0xd2631e (0x7f0561b1031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #20: + 0x47def4 (0x7f0561267ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #13: + 0x5ae6f10 (0x7fadda589f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563224362a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank49]: frame #18: + 0x5af36b5 (0x7faa7047f6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #19: + 0xd2631e (0x7faa8306931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank49]: frame #20: + 0x47def4 (0x7faa827c0ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank49]: frame #21: + 0x1445a6 (0x559ec0b525a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55d1f2c5ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563224352c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default1]:[rank49]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559ec0b4ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #23: + 0x150866 (0x559ec0b5e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559ec0b47142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559ec0b52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #26: PyObject_Call + 0xbc (0x559ec0b5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #21: + 0x1445a6 (0x55e5c85f25a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55e5c85eba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: dist.recv( [default2]:[rank18]: frame #46: PyObject_Call + 0xbc (0x5599a5631f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559ec0b452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559ec0b52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #30: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #23: + 0x150866 (0x55e5c85fe866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #30: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #21: + 0x1445a6 (0x5599a4d8f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank18]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a56182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9930332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #32: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #34: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e5c85e7142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #20: + 0x47def4 (0x7fa3a6d27ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: frame #32: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: return func(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559ec0b4af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559ec0b5cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #38: + 0x211239 (0x559ec0c1f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55e5c85f2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #26: PyObject_Call + 0xbc (0x55e5c85fef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563224362a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559ec0b4ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559ec0b473e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559ec0b52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e5c85e52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5599a4d88a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56518969ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559ec0b42c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559ec0b52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #45: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #23: + 0x150866 (0x5599a4d9b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5580c866ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b993040a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #46: PyObject_Call + 0xbc (0x559ec0b5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559ec0b452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #48: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #49: PyObject_Call + 0xbc (0x559ec0b5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55e5c85f2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #21: + 0x1445a6 (0x55d5815b45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #23: + 0x150866 (0x5580c867f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank49]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559ec0b452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559ec0b52a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559ec0b4b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559ec0b5cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #30: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #45: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #48: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5651896b0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #38: + 0x211239 (0x565189773239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #54: + 0x211239 (0x559ec0c1f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #55: PyObject_Call + 0x207 (0x559ec0b5f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559ec0b452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #57: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #46: PyObject_Call + 0xbc (0x56322436ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: return func(*args, **kwargs) [default2]:[rank18]: frame #49: PyObject_Call + 0xbc (0x5599a5631f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559ec0b438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #59: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #60: PyObject_Call + 0xbc (0x559ec0b5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559ec0b452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55d5815ada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #23: + 0x150866 (0x55d5815c0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5580c8668142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a56182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: pipeline_state.run_communication() [default1]:[rank49]: frame #62: + 0x150582 (0x559ec0b5e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #63: PyObject_Call + 0xbc (0x559ec0b5ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank35]: frame #32: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5580c8673a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56518969fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56518969b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: Traceback (most recent call last): [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5599a4d84142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5599a4d8fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #34: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5632243552b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #48: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5651896a6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: trainer.train(dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d5815a9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55d5815b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #26: PyObject_Call + 0xbc (0x5580c867ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b993039007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b99304ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank52]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank52]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank32]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5580c86662b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: output = model(**micro_batch) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d1f2c56f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #49: PyObject_Call + 0xbc (0x56322436ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: recv_activation_tensor = recv_activation() [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank52]: sharded_logits = self.model( [default3]:[rank35]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #34: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5632243552b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank16]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default0]:[rank32]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55d1f2c68c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #38: + 0x211239 (0x55d1f2d2b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #26: PyObject_Call + 0xbc (0x55d5815c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #14: + 0x5ae6fa5 (0x7fadda589fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5599a5625a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5599a561e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank52]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank34]: frame #26: PyObject_Call + 0xbc (0x5599a4d9bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #15: + 0x5124446 (0x7fadd9bc7446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #54: + 0x211239 (0x55b99310d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a4d822b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e5c85eaf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563224362a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank37]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d5815a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55d5815b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56322435b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x565189696c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank52]: pipeline_state.run_communication() [default3]:[rank35]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55e5c85fcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5580c8673a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank23]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: recv_activation_tensor = recv_activation() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank32]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55d1f2c57a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank18]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5599a562fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank52]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank32]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d1f2c533e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #38: + 0x211239 (0x55e5c86bf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55e5c85eba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank18]: frame #54: + 0x211239 (0x5599a56f2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank35]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e5c85e73e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55d1f2c5ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: dist.recv( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank52]: return func(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank34]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5599a4d8fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #16: + 0x1acf4b8 (0x7fadd65724b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5651896a6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank52]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default3]:[rank35]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55e5c85f2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #30: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank52]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f759e8b7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank34]: frame #30: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank18]: frame #55: PyObject_Call + 0x207 (0x5599a5632067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a56182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #1: + 0x5b3a23e (0x7f75d83d423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f75d83cec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e5c85e2c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55e5c85f2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #17: + 0x5aee004 (0x7fadda591004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank23]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f75d83cef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f75d83cffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f75d8384371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #32: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #32: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #55: PyObject_Call + 0x207 (0x55b99304d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #57: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5599a56168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f75d8384371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f75d8384371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f75d8384371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #34: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bc23b2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank18]: frame #59: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f759fb91189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f759fb98610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f759fbb7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank34]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #18: + 0x5af36b5 (0x7fadda5966b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: frame #12: + 0x5adc309 (0x7f75d8376309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #13: + 0x5ae6f10 (0x7f75d8380f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d1f2c4ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank16]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb28a79b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank52]: frame #14: + 0x5ae6fa5 (0x7f75d8380fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #15: + 0x5124446 (0x7f75d79be446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #16: + 0x1acf4b8 (0x7f75d43694b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #17: + 0x5aee004 (0x7f75d8388004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55d1f2c5ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #1: + 0x5b3a23e (0x7f2bfbecf23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #18: + 0x5af36b5 (0x7f75d838d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #19: + 0xd2631e (0x7f75eaf7731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank52]: frame #20: + 0x47def4 (0x7f75ea6ceef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #45: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f2bfbec9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56322436cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #54: + 0x211239 (0x56322442f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: return forward_call(*args, **kwargs) [default4]:[rank52]: frame #21: + 0x1445a6 (0x5600f668b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5600f6684a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #23: + 0x150866 (0x5600f6697866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5600f6680142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5600f668ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #26: PyObject_Call + 0xbc (0x5600f6697f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5600f667e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster[default2]:[rank34]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5599a4d87f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5599a4d99c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #55: PyObject_Call + 0x207 (0x56322436f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #60: PyObject_Call + 0xbc (0x5599a5631f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #1: + 0x5b3a23e (0x7fb2c42b823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb2c42b2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) /bin/python3.10) [default3]:[rank35]: frame #46: PyObject_Call + 0xbc (0x55e5c85fef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e5c85e52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #19: + 0xd2631e (0x7faded18031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank52]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5600f668ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #30: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #32: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #48: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #30: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #20: + 0x47def4 (0x7fadec8d7ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank19]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #34: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5600f6683f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #49: PyObject_Call + 0xbc (0x55e5c85fef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5632243552b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #57: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #45: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5600f6695c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #38: + 0x211239 (0x5600f6758239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5600f6684a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5600f66803e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5600f668ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2bfbec9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a56182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5600f667bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5600f668ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #45: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #45: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5632243538fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #59: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank16]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb2c42b2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #46: PyObject_Call + 0xbc (0x5600f6697f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5600f667e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #48: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #49: PyObject_Call + 0xbc (0x5600f6697f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #38: + 0x211239 (0x5599a4e5c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e5c85e52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #60: PyObject_Call + 0xbc (0x56322436ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2bfbecafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank23]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5600f667e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5600f668ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5600f6684007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55e5c85f2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #32: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5632243552b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #62: + 0x150582 (0x56322436e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #46: PyObject_Call + 0xbc (0x5651896b2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5600f6695c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #54: + 0x211239 (0x5600f6758239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #55: PyObject_Call + 0x207 (0x5600f6698067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5600f667e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e5c85eb007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2bfbe7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5651896992b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: frame #62: + 0x150582 (0x5599a5631582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #57: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5600f667c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #59: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #60: PyObject_Call + 0xbc (0x5600f6697f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5599a4d88a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2bfbe7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank18]: frame #63: PyObject_Call + 0xbc (0x5599a5631f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank52]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5600f667e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #62: + 0x150582 (0x5600f6697582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #63: PyObject_Call + 0xbc (0x5600f6697f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank32]: frame #46: PyObject_Call + 0xbc (0x55d1f2c6af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #34: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #63: PyObject_Call + 0xbc (0x56322436ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: dist.recv( [default2]:[rank34]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5599a4d843e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #21: + 0x1445a6 (0x55bf533655a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank18]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55bf5335ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2bfbe7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb2c42b3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d5815acf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55d5815bec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #34: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2c4268371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #38: + 0x211239 (0x55d581681239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank16]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2c4268371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d1f2c512b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2bfbe7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2bc368c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5599a4d8fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5599a4d7fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5580c866bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2c4268371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb2c4268371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55d5815ada6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2bc3693610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank19]: pipeline_state.run_communication() [default5]:[rank21]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9930332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55e5c85fcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #54: + 0x211239 (0x55e5c86bf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #23: + 0x150866 (0x55bf53371866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #48: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #49: PyObject_Call + 0xbc (0x5651896b2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d5815a93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5580c867dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f2bc36b2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank16]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb28ba75189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55d5815b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bf5335a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: return func(*args, **kwargs) [default2]:[rank34]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5599a4d8fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #12: + 0x5adc309 (0x7f2bfbe71309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5651896992b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #55: PyObject_Call + 0x207 (0x55e5c85ff067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #38: + 0x211239 (0x5580c8740239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5651896a6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d5815a4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55bf53365a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #13: + 0x5ae6f10 (0x7f2bfbe7bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55d5815b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5580c866ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb28ba7c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e5c85e52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #57: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5580c86683e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb28ba9b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank19]: recv_activation_tensor = recv_activation() [default2]:[rank34]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #14: + 0x5ae6fa5 (0x7f2bfbe7bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank21]: frame #57: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #45: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5580c8673a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank32]: frame #48: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #49: PyObject_Call + 0xbc (0x55d1f2c6af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #26: PyObject_Call + 0xbc (0x55bf53371f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b9930318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #59: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #46: PyObject_Call + 0xbc (0x5599a4d9bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #15: + 0x5124446 (0x7f2bfb4b9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #12: + 0x5adc309 (0x7fb2c425a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #13: + 0x5ae6f10 (0x7fb2c4264f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a4d822b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5580c8663c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: frame #45: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e5c85e38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #16: + 0x1acf4b8 (0x7f2bf7e644b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank22]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56518969f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #60: PyObject_Call + 0xbc (0x55b99304cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank19]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank35]: frame #59: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5580c8673a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9930332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d1f2c512b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55d1f2c5ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #48: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #17: + 0x5aee004 (0x7f2bfbe83004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank21]: frame #62: + 0x150582 (0x55b99304c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #46: PyObject_Call + 0xbc (0x55d5815c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf533582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5651896b0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d5815a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #60: PyObject_Call + 0xbc (0x55e5c85fef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e5c85e52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55bf53365a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #18: + 0x5af36b5 (0x7f2bfbe886b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #14: + 0x5ae6fa5 (0x7fb2c4264fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #62: + 0x150582 (0x55e5c85fe582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #19: + 0xd2631e (0x7f2c0ea7231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank23]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank35]: frame #63: PyObject_Call + 0xbc (0x55e5c85fef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #20: + 0x47def4 (0x7f2c0e1c9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank23]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank22]: frame #54: + 0x211239 (0x565189773239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d1f2c57007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #48: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #49: PyObject_Call + 0xbc (0x55d5815c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d5815a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #55: PyObject_Call + 0x207 (0x5651896b3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #49: PyObject_Call + 0xbc (0x5599a4d9bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #21: + 0x1445a6 (0x55626f3c35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #45: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa8aa156897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank32]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55d1f2c68c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #54: + 0x211239 (0x55d1f2d2b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #30: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #1: + 0x5b3a23e (0x7fa8e3c7323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a4d822b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55626f3bca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: frame #63: PyObject_Call + 0xbc (0x55b99304cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55d5815b4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #55: PyObject_Call + 0x207 (0x55d1f2c6b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank21]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank16]: frame #15: + 0x5124446 (0x7fb2c38a2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5599a4d8fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5599a4d88007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #23: + 0x150866 (0x55626f3cf866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #46: PyObject_Call + 0xbc (0x5580c867ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5651896992b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d5815ad007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #32: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5580c86662b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank34]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5599a4d99c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank16]: frame #16: + 0x1acf4b8 (0x7fb2c024d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa8e3c6dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d1f2c512b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55d5815bec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #48: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #49: PyObject_Call + 0xbc (0x5580c867ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa8e3c6df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #54: + 0x211239 (0x5599a4e5c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #55: PyObject_Call + 0x207 (0x5599a4d9c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #34: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #57: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #57: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55626f3b8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: dist.recv( [default3]:[rank35]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank42]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5580c86662b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #17: + 0x5aee004 (0x7fb2c426c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a4d822b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55626f3c3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #18: + 0x5af36b5 (0x7fb2c42716b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa8e3c6efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #57: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bf5335df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa8e3c23371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d1f2c4f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #54: + 0x211239 (0x55d581681239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #26: PyObject_Call + 0xbc (0x55626f3cff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5651896978fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #55: PyObject_Call + 0x207 (0x55d5815c1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55bf5336fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5580c8673a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank32]: frame #59: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #38: + 0x211239 (0x55bf53432239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa8e3c23371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa8e3c23371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5599a4d808fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5580c866c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55626f3b62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #59: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #59: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d5815a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55bf5335ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #60: PyObject_Call + 0xbc (0x5651896b2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #60: PyObject_Call + 0xbc (0x55d1f2c6af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5580c867dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa8e3c23371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d1f2c512b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #57: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55626f3c3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa8ab430189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d5815a58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bf5335a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5651896992b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #60: PyObject_Call + 0xbc (0x5599a4d9bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55bf53365a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #54: + 0x211239 (0x5580c8740239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: return func(*args, **kwargs) [default0]:[rank32]: frame #62: + 0x150582 (0x55d1f2c6a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #59: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #30: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #19: + 0xd2631e (0x7fb2d6e5b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank34]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5599a4d822b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #62: + 0x150582 (0x5599a4d9b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #63: PyObject_Call + 0xbc (0x55d1f2c6af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #55: PyObject_Call + 0x207 (0x5580c8680067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: frame #60: PyObject_Call + 0xbc (0x55d5815c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank23]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa8ab437610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank41]: frame #32: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa8ab456978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank34]: frame #63: PyObject_Call + 0xbc (0x5599a4d9bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bf53355c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #20: + 0x47def4 (0x7fb2d65b2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank37]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d5815a72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5580c86662b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #21: + 0x1445a6 (0x55d12692a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank45]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55bf53365a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #12: + 0x5adc309 (0x7fa8e3c15309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #62: + 0x150582 (0x55d5815c0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #63: PyObject_Call + 0xbc (0x55d5815c0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #13: + 0x5ae6f10 (0x7fa8e3c1ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank45]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #34: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #62: + 0x150582 (0x5651896b2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #57: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: frame #63: PyObject_Call + 0xbc (0x5651896b2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #45: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #14: + 0x5ae6fa5 (0x7fa8e3c1ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55626f3bbf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55626f3cdc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #15: + 0x5124446 (0x7fa8e325d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #46: PyObject_Call + 0xbc (0x55bf53371f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf533582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank41]: frame #38: + 0x211239 (0x55626f490239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5580c86648fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #48: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55626f3bca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #16: + 0x1acf4b8 (0x7fa8dfc084b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #49: PyObject_Call + 0xbc (0x55bf53371f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #17: + 0x5aee004 (0x7fa8e3c27004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55626f3b83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd8bb9e5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank45]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf533582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #1: + 0x5b3a23e (0x7fd8f550223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #59: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55626f3c3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55626f3b3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank22]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank45]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55bf53365a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #18: + 0x5af36b5 (0x7fa8e3c2c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #60: PyObject_Call + 0xbc (0x5580c867ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #19: + 0xd2631e (0x7fa8f681631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank19]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fd8f54fcc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55d126923a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bf5335e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #20: + 0x47def4 (0x7fa8f5f6def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank42]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5580c86662b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #21: + 0x1445a6 (0x5556094e35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55bf5336fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #23: + 0x150866 (0x55d126936866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #62: + 0x150582 (0x5580c867f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5556094dca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #23: + 0x150866 (0x5556094ef866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55626f3c3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #54: + 0x211239 (0x55bf53432239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #55: PyObject_Call + 0x207 (0x55bf53372067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d12691f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #63: PyObject_Call + 0xbc (0x5580c867ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5556094d8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5556094e3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd8f54fcf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd8f54fdfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #45: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55d12692aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf533582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd8f54b2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #46: PyObject_Call + 0xbc (0x55626f3cff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #26: PyObject_Call + 0xbc (0x55d126936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #26: PyObject_Call + 0xbc (0x5556094eff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55626f3b62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #57: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5556094d62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #48: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d12691d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd8f54b2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5556094e3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bf533568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #59: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #60: PyObject_Call + 0xbc (0x55bf53371f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf533582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd8f54b2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #62: + 0x150582 (0x55bf53371582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #49: PyObject_Call + 0xbc (0x55626f3cff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55626f3b62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55626f3c3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd8f54b2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55626f3bc007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55626f3cdc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd8bccbf189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: frame #54: + 0x211239 (0x55626f490239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55d12692aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #63: PyObject_Call + 0xbc (0x55bf53371f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fd8bccc6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank19]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fd8bcce5978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: frame #55: PyObject_Call + 0x207 (0x55626f3d0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #12: + 0x5adc309 (0x7fd8f54a4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #30: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: frame #13: + 0x5ae6f10 (0x7fd8f54aef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55626f3b62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #57: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55626f3b48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #59: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #60: PyObject_Call + 0xbc (0x55626f3cff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55626f3b62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #32: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #62: + 0x150582 (0x55626f3cf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #63: PyObject_Call + 0xbc (0x55626f3cff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank16]: frame #34: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #30: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #32: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #14: + 0x5ae6fa5 (0x7fd8f54aefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #34: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d126922f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #15: + 0x5124446 (0x7fd8f4aec446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55d126934c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5556094dbf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #38: + 0x211239 (0x55d1269f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #16: + 0x1acf4b8 (0x7fd8f14974b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55d126923a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #17: + 0x5aee004 (0x7fd8f54b6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank16]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d12691f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5556094edc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #38: + 0x211239 (0x5556095b0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55d12692aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5556094dca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5556094d83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #18: + 0x5af36b5 (0x7fd8f54bb6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank23]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5556094e3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5556094d3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d12691ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #19: + 0xd2631e (0x7fd9080a531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank19]: frame #20: + 0x47def4 (0x7fd9077fcef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank19]: frame #21: + 0x1445a6 (0x5561e41135a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55d12692aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5561e410ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #23: + 0x150866 (0x5561e411f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5556094e3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #45: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #45: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5561e4108142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5561e4113a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #46: PyObject_Call + 0xbc (0x5556094eff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #46: PyObject_Call + 0xbc (0x55d126936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5556094d62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #26: PyObject_Call + 0xbc (0x5561e411ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #48: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #49: PyObject_Call + 0xbc (0x5556094eff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d12691d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5556094d62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5561e41062b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5556094e3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #48: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #49: PyObject_Call + 0xbc (0x55d126936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5561e4113a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d12691d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5556094dc007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5556094edc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55d12692aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #54: + 0x211239 (0x5556095b0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #55: PyObject_Call + 0x207 (0x5556094f0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d126923007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55d126934c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #54: + 0x211239 (0x55d1269f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #55: PyObject_Call + 0x207 (0x55d126937067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5556094d62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #57: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d12691d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5556094d48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #59: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #30: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #60: PyObject_Call + 0xbc (0x5556094eff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5556094d62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #57: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d12691b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #62: + 0x150582 (0x5556094ef582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: frame #63: PyObject_Call + 0xbc (0x5556094eff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank23]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: frame #32: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #59: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #60: PyObject_Call + 0xbc (0x55d126936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d12691d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #62: + 0x150582 (0x55d126936582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: frame #63: PyObject_Call + 0xbc (0x55d126936f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank16]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank19]: frame #34: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5561e410bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5561e411dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #38: + 0x211239 (0x5561e41e0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5561e410ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5561e41083e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5561e4113a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5561e4103c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5561e4113a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #45: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #46: PyObject_Call + 0xbc (0x5561e411ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5561e41062b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #48: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #49: PyObject_Call + 0xbc (0x5561e411ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5561e41062b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5561e4113a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5561e410c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5561e411dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #54: + 0x211239 (0x5561e41e0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #55: PyObject_Call + 0x207 (0x5561e4120067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5561e41062b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #57: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5561e41048fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #59: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #60: PyObject_Call + 0xbc (0x5561e411ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5561e41062b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #62: + 0x150582 (0x5561e411f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: frame #63: PyObject_Call + 0xbc (0x5561e411ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank19]: . This may indicate a possible application crash on rank 0 or a network set up issue. W0703 03:18:54.302000 139993479001920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 683396 closing signal SIGTERM W0703 03:18:54.303000 139993479001920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 683398 closing signal SIGTERM W0703 03:18:54.303000 139993479001920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 683399 closing signal SIGTERM W0703 03:18:54.303000 139993479001920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 683400 closing signal SIGTERM E0703 03:18:54.426000 140665481819968 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 887276) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 887277) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 887278) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 887279) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 887280) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 887281) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 887282) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 887283) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:54 host : ip-26-0-161-103.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 887276) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-161-103: task 1: Exited with exit code 1 E0703 03:18:55.125000 139993479001920 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 683394) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:54 host : ip-26-0-161-138.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 683395) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:54 host : ip-26-0-161-138.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 683397) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:54 host : ip-26-0-161-138.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 683401) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:54 host : ip-26-0-161-138.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 683394) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-161-138: task 2: Exited with exit code 1 W0703 03:18:59.307000 140703858952000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1162957 closing signal SIGTERM W0703 03:18:59.308000 140703858952000 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1162960 closing signal SIGTERM W0703 03:18:59.311000 140661720811328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3783390 closing signal SIGTERM W0703 03:18:59.311000 140661720811328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3783392 closing signal SIGTERM W0703 03:18:59.311000 140661720811328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3783396 closing signal SIGTERM E0703 03:18:59.431000 140672604792640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 21938) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 E0703 03:18:59.433000 139967333427008 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1438906) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 33 (local_rank: 1) exitcode : 1 (pid: 21939) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 34 (local_rank: 2) exitcode : 1 (pid: 21940) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 35 (local_rank: 3) exitcode : 1 (pid: 21941) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 36 (local_rank: 4) exitcode : 1 (pid: 21942) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 37 (local_rank: 5) exitcode : 1 (pid: 21943) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 38 (local_rank: 6) exitcode : 1 (pid: 21944) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 39 (local_rank: 7) exitcode : 1 (pid: 21945) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:59 host : ip-26-0-166-125.ec2.internal rank : 32 (local_rank: 0) exitcode : 1 (pid: 21938) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 17 (local_rank: 1) exitcode : 1 (pid: 1438907) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 18 (local_rank: 2) exitcode : 1 (pid: 1438908) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 19 (local_rank: 3) exitcode : 1 (pid: 1438909) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 20 (local_rank: 4) exitcode : 1 (pid: 1438910) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 21 (local_rank: 5) exitcode : 1 (pid: 1438911) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 22 (local_rank: 6) exitcode : 1 (pid: 1438912) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 23 (local_rank: 7) exitcode : 1 (pid: 1438913) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:59 host : ip-26-0-161-153.ec2.internal rank : 16 (local_rank: 0) exitcode : 1 (pid: 1438906) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0703 03:18:59.632000 140703858952000 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1162955) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 25 (local_rank: 1) exitcode : 1 (pid: 1162956) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 27 (local_rank: 3) exitcode : 1 (pid: 1162958) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 28 (local_rank: 4) exitcode : 1 (pid: 1162959) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 1162961) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 1162962) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:59 host : ip-26-0-161-78.ec2.internal rank : 24 (local_rank: 0) exitcode : 1 (pid: 1162955) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0703 03:18:59.732000 140661720811328 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 3783391) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 srun: error: ip-26-0-166-125: task 4: Exited with exit code 1 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:18:59 host : ip-26-0-171-102.ec2.internal rank : 43 (local_rank: 3) exitcode : 1 (pid: 3783393) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:18:59 host : ip-26-0-171-102.ec2.internal rank : 44 (local_rank: 4) exitcode : 1 (pid: 3783394) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:18:59 host : ip-26-0-171-102.ec2.internal rank : 45 (local_rank: 5) exitcode : 1 (pid: 3783395) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:18:59 host : ip-26-0-171-102.ec2.internal rank : 47 (local_rank: 7) exitcode : 1 (pid: 3783397) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:18:59 host : ip-26-0-171-102.ec2.internal rank : 41 (local_rank: 1) exitcode : 1 (pid: 3783391) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-161-153: task 3: Exited with exit code 1 srun: error: ip-26-0-161-78: task 0: Exited with exit code 1 srun: error: ip-26-0-171-102: task 7: Exited with exit code 1 W0703 03:19:03.976000 140593047774976 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3912329_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:04.268000 139844849026816 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_900772_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:04.307000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900845 closing signal SIGTERM W0703 03:19:04.307000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900846 closing signal SIGTERM W0703 03:19:04.308000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900847 closing signal SIGTERM W0703 03:19:04.308000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900848 closing signal SIGTERM W0703 03:19:04.309000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900849 closing signal SIGTERM W0703 03:19:04.310000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900850 closing signal SIGTERM W0703 03:19:04.310000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900851 closing signal SIGTERM W0703 03:19:04.310000 139850509760320 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 900852 closing signal SIGTERM E0703 03:19:04.441000 140598708508480 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3912402) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 03:19:04.448000 140598708508480 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3912329_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:04.476000 140598708508480 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3912329_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:04.504000 140598708508480 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3912329_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 49 (local_rank: 1) exitcode : 1 (pid: 3912403) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 50 (local_rank: 2) exitcode : 1 (pid: 3912404) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 51 (local_rank: 3) exitcode : 1 (pid: 3912405) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 52 (local_rank: 4) exitcode : 1 (pid: 3912406) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 53 (local_rank: 5) exitcode : 1 (pid: 3912407) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 54 (local_rank: 6) exitcode : 1 (pid: 3912408) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 55 (local_rank: 7) exitcode : 1 (pid: 3912409) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_03:19:04 host : ip-26-0-171-62.ec2.internal rank : 48 (local_rank: 0) exitcode : 1 (pid: 3912402) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-171-62: task 5: Exited with exit code 1 W0703 03:19:09.272000 139844849026816 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_900772_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:12.657000 139850509760320 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_900772_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:19:12.668000 139850509760320 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_900772_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-171-88: task 6: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.