======================== START TIME: Wed Jul 3 09:42:04 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 09:42:07.708000 140380479002432 torch/distributed/run.py:757] W0703 09:42:07.708000 140380479002432 torch/distributed/run.py:757] ***************************************** W0703 09:42:07.708000 140380479002432 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:07.708000 140380479002432 torch/distributed/run.py:757] ***************************************** W0703 09:42:09.895000 139841750644544 torch/distributed/run.py:757] W0703 09:42:09.895000 139841750644544 torch/distributed/run.py:757] ***************************************** W0703 09:42:09.895000 139841750644544 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:09.895000 139841750644544 torch/distributed/run.py:757] ***************************************** W0703 09:42:09.897000 139791769425728 torch/distributed/run.py:757] W0703 09:42:09.897000 139791769425728 torch/distributed/run.py:757] ***************************************** W0703 09:42:09.897000 139791769425728 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:09.897000 139791769425728 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.212000 140004522362688 torch/distributed/run.py:757] W0703 09:42:10.212000 140004522362688 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.212000 140004522362688 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:10.212000 140004522362688 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.343000 139794222888768 torch/distributed/run.py:757] W0703 09:42:10.343000 139794222888768 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.343000 139794222888768 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:10.343000 139794222888768 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.946000 139687632484160 torch/distributed/run.py:757] W0703 09:42:10.946000 139687632484160 torch/distributed/run.py:757] ***************************************** W0703 09:42:10.946000 139687632484160 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:10.946000 139687632484160 torch/distributed/run.py:757] ***************************************** W0703 09:42:11.176000 140060554409792 torch/distributed/run.py:757] W0703 09:42:11.176000 140060554409792 torch/distributed/run.py:757] ***************************************** W0703 09:42:11.176000 140060554409792 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:11.176000 140060554409792 torch/distributed/run.py:757] ***************************************** W0703 09:42:11.512000 140540360554304 torch/distributed/run.py:757] W0703 09:42:11.512000 140540360554304 torch/distributed/run.py:757] ***************************************** W0703 09:42:11.512000 140540360554304 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 09:42:11.512000 140540360554304 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 09:42:36 [WARNING|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Vocab Size Padding] Padded vocab (size: 50257) with 15 dummy tokens (new size: 50272) [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Config: [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: run='%date_%jobid', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: seed=42, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: step=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: consumed_train_samples=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: benchmark_csv_path=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: ignore_sanity_checks=True), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pp=2, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp=32, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pp_engine=, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp_mode=, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tp_linear_async_communication=False, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: expert_parallel_size=1), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: eos_token_id=2, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_act='silu', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_size=2048, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: initializer_range=0.02, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: intermediate_size=4096, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: is_llama_config=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: max_position_embeddings=4096, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_attention_heads=32, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_hidden_layers=24, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_key_value_heads=32, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pad_token_id=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pretraining_tp=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rms_norm_eps=1e-05, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_scaling=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_theta=10000.0, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tie_word_embeddings=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: use_cache=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: vocab_size=50272), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dtype=torch.bfloat16, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer_revision=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokenizer_max_length=None), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoint_interval=100000, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: save_initial_state=False, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: resume_checkpoint_path=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: log_level_replica='info', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: iteration_step_info_interval=1), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: train_steps=20, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: micro_batch_size=128, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: batch_accumulation_per_replica=8, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: val_check_interval=-1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: limit_val_batches=0, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: limit_test_batches=0), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: adam_beta1=0.9, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: adam_beta2=0.95, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: torch_adam_is_fused=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: name='adamW'), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: zero_stage=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: weight_decay=0.01, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: clip_grad=1.0, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_warmup_steps=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_warmup_style='linear', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_style='linear', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_steps=19, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lr_decay_starting_step=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: min_decay_lr=1e-05)), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: start_training_step=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hf_dataset_splits='train', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hf_dataset_config_name=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: dataset_overwrite_cache=False, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: text_column_name='text'), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: seed=42, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_loading_workers=0))], [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-32_pp-2_mbz-128')), [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: lighteval=None) [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Model Config: [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: eos_token_id=2, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_act='silu', [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: hidden_size=2048, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: initializer_range=0.02, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: intermediate_size=4096, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: is_llama_config=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: max_position_embeddings=4096, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_attention_heads=32, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_hidden_layers=24, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: num_key_value_heads=32, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pad_token_id=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: pretraining_tp=1, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rms_norm_eps=1e-05, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_scaling=None, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: rope_theta=10000.0, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: tie_word_embeddings=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: use_cache=True, [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: vocab_size=50272) [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Building model.. [default0]:07/03/2024 09:42:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Setting PP block ranks... [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=11|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=11|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=11|ip-26-0-171-56]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=13|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=13|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=13|ip-26-0-171-56]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=15|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=8|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=15|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=15|ip-26-0-171-56]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=10|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=10|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=8|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=8|ip-26-0-171-56]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=14|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=10|ip-26-0-171-56]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=14|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=14|ip-26-0-171-56]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=12|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=12|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=12|ip-26-0-171-56]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=8|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=8|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=8|ip-26-0-169-207]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=15|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=15|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=15|ip-26-0-169-207]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=2|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=2|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=2|ip-26-0-170-31]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=9|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=9|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=4|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=4|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=4|ip-26-0-169-139]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=3|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=3|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=3|ip-26-0-170-31]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=9|ip-26-0-169-207]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=12|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=12|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=12|ip-26-0-169-207]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=14|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=14|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=14|ip-26-0-169-207]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=10|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=9|ip-26-0-171-56]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=10|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=11|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=0|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=0|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=9|ip-26-0-171-56]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=9|ip-26-0-171-56]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=11|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=10|ip-26-0-169-207]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=13|ip-26-0-169-207]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=13|ip-26-0-169-207]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=11|ip-26-0-169-207]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=0|ip-26-0-170-31]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=13|ip-26-0-169-207]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=3|ip-26-0-169-139]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=1|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=1|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=1|ip-26-0-170-31]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Total number of parameters: 1.22G (2318.88MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Parametrizing model parameters using StandardParametrizator [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=26|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=26|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=26|ip-26-0-171-88]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=5|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=5|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=5|ip-26-0-169-139]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=6|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=6|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=6|ip-26-0-169-139]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=2|ip-26-0-169-139]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=1|ip-26-0-169-139]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=6|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=6|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=6|ip-26-0-170-31]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=7|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=7|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=7|ip-26-0-170-31]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=5|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=5|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=5|ip-26-0-170-31]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=4|ip-26-0-170-31]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=4|ip-26-0-170-31]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=4|ip-26-0-170-31]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=23|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=23|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=23|ip-26-0-169-239]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=24|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=24|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=24|ip-26-0-171-88]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=25|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=25|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=25|ip-26-0-169-247]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=24|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=24|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=24|ip-26-0-169-247]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=27|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=27|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=27|ip-26-0-169-247]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=18|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=18|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=18|ip-26-0-171-62]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=19|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=19|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=19|ip-26-0-171-62]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=31|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=21|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=21|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=23|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=31|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=31|ip-26-0-169-247]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=21|ip-26-0-169-239]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=23|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=23|ip-26-0-171-62]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=21|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=22|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=21|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=21|ip-26-0-171-62]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=22|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=22|ip-26-0-169-239]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=17|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=17|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=17|ip-26-0-171-62]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=20|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=20|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=16|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=16|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=16|ip-26-0-169-239]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=20|ip-26-0-169-239]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=18|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=26|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=17|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=17|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=17|ip-26-0-169-239]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=18|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=26|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=18|ip-26-0-169-239]: No checkpoint path provided. [default2]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=26|ip-26-0-169-247]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=28|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=28|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=28|ip-26-0-169-247]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=19|ip-26-0-169-239]: Local number of parameters: 21.6M (41.25MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=19|ip-26-0-169-239]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=19|ip-26-0-169-239]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=20|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=20|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=25|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=25|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=20|ip-26-0-171-62]: No checkpoint path provided. [default1]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=25|ip-26-0-171-88]: No checkpoint path provided. [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=28|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=28|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default4]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=28|ip-26-0-171-88]: No checkpoint path provided. [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=27|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=27|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default3]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=27|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=7|ip-26-0-169-139]: Local number of parameters: 21.6M (41.25MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=7|ip-26-0-169-139]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=7|ip-26-0-169-139]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=29|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=29|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=29|ip-26-0-169-247]: No checkpoint path provided. [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=29|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=29|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default5]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=29|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=31|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=31|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default7]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=31|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=30|ip-26-0-171-88]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=30|ip-26-0-171-88]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=30|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=22|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=22|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=22|ip-26-0-171-62]: No checkpoint path provided. [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=16|ip-26-0-171-62]: Local number of parameters: 16.4M (31.22MiB) [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=16|ip-26-0-171-62]: [After model building] Memory usage: 41.23MiB. Peak allocated: 43.26MiB Peak reserved: 58.00MiB [default0]:07/03/2024 09:42:55 [INFO|DP=0|PP=1|TP=16|ip-26-0-171-62]: No checkpoint path provided. [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=30|ip-26-0-169-247]: Local number of parameters: 21.6M (41.25MiB) [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=30|ip-26-0-169-247]: [After model building] Memory usage: 55.26MiB. Peak allocated: 57.29MiB Peak reserved: 72.00MiB [default6]:07/03/2024 09:42:55 [INFO|DP=0|PP=0|TP=30|ip-26-0-169-247]: No checkpoint path provided. [default0]:07/03/2024 09:42:56 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 09:42:56 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 09:42:56 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [ZeRO sharding] DP Rank 0 has 21.6M out of 21.6M (100.00%) params' optimizer states [default0]:07/03/2024 09:42:58 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 09:42:58 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Using `datasets` library [default0]:07/03/2024 09:42:58 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 09:42:58 [WARNING|DP=0|PP=0|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Training Plan] There are 1 training stages [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: [Start training] datetime: 2024-07-03 09:42:59.717786 | mbs: 128 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 09:42:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-169-139]: Memory usage: 220.25MiB. Peak allocated 220.25MiB. Peak reserved: 240.00MiB [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=6|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=30|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=11|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=13|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=14|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=15|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=10|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=8|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=12|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=4|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=12|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=10|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=13|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=3|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=2|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=9|ip-26-0-171-56]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=11|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=14|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=1|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=0|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=3|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=26|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=7|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=6|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=1|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=5|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=4|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=5|ip-26-0-170-31]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=2|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=23|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=24|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=27|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=22|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=31|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=20|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=25|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=21|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=16|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=24|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=18|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=17|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=26|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=18|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=28|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=19|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=23|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=17|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=20|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=28|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=27|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=25|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=19|ip-26-0-169-239]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=29|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=7|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=31|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=29|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=22|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=16|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=8|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 09:43:00 [WARNING|DP=0|PP=0|TP=15|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 09:42:59 [WARNING|DP=0|PP=0|TP=9|ip-26-0-169-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 09:42:59 [WARNING|DP=0|PP=1|TP=21|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 09:43:05 [WARNING|DP=0|PP=1|TP=30|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/l[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank8]: output = self.pp_block(**new_kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return fo[default0]:[rank8]: return self._call_impl(*args, **kwargs) rward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank2]: output = self.o_proj(attention_output) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank2]: return row_linear( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank2]: out = F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.47 GiB is free. Including non-PyTorch memory, this process has 77.85 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ([default0]:[rank8]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default0]:[rank8]: output = self.o_proj(attention_output) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: return row_linear( [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: out = F.linear(input, weight, bias) [default0]:[rank8]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank10]: output = self.pp_block(**new_kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank10]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank10]: output = self.o_proj(attention_output) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return row_linear( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank10]: out = F.linear(input, weight, bias) [default3]:[rank11]: return forward_call(*args, **kwargs) [default2]:[rank10]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank11]: output = self.pp_block(**new_kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank11]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank11]: output = self.o_proj(attention_output) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank11]: return row_linear( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank11]: out = F.linear(input, weight, bias) [default3]:[rank11]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank5]: output = self.pp_block(**new_kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5[default7]:[rank15]: Traceback (most recent call last): [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: trainer.train(dataloader) [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: sharded_logits = self.model( [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: output = self.pp_block(**new_kwargs) [default7]:[rank15]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank9]: output = self.o_proj(attention_output) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: output = self.pp_block(**new_kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank15]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return row_linear( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank9]: out = F.linear(input, weight, bias) [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank15]: output = self.o_proj(attention_output) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank15]: return row_linear( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default7]:[rank15]: out = F.linear(input, weight, bias) [default7]:[rank15]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default5]:[rank13]: Traceback (most recent call last): [default6]:[rank14]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: output = model(**micro_batch) [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: return forward_call(*args, **kwargs) [default4]:[rank12]: sharded_logits = self.model( [default5]:[rank13]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: sharded_logits = self.model( [default5]:[rank13]: sharded_logits = self.model( [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: trainer.train(dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank26]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank12]: output = self.pp_block(**new_kwargs) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank26]: output = model(**micro_batch) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank26]: sharded_logits = self.model( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [def[default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) ault2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank26]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank26]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank26]: output = self.pp_block(**new_kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank26]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn[default4]:[rank12]: return self._call_impl(*args, **kwargs) /modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank26]: output = self.o_proj(attention_output) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank2[default5]:[rank13]: return self._call_impl(*args, **kwargs) 6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank26]: return row_linear( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank26]: out = F.linear(input, weight, bias) [default2]:[rank26]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 673.94 MiB is free. Including non-PyTorch memory, this process has 78.66 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank24]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank24]: output = model(**micro_batch) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank24]: sharded_logits = self.model( [default5]:[rank13]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank24]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank24]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank24]: output = self.pp_block(**new_kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward 24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank24]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default0]:[rank24]: output = self.o_proj(attention_output) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clust[default5]:[rank13]: output = self.pp_block(**new_kwargs) er/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default0]:[rank24]: return row_linear( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default0]:[rank24]: out = F.linear(input, weight, bias) [default0]:[rank24]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: trainer.train(dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: return forward_call(*args, **kwargs) [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank28]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank28]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank28]: sharded_logits = self.model( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank28]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank28]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank28]: output = self.pp_block(**new_kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default5]:[rank13]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank28]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: output = self.o_proj(attention_output) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: output = self.pp_block(**new_kwargs) [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank28]: output = self.o_proj(attention_output) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank28]: return row_linear( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank28]: out = F.linear(input, weight, bias) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 673.94 MiB is free. Including non-PyTorch memory, this process has 78.66 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default5]:[rank13]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank12]: return row_linear( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank14]: return forward_call(*args, **kwargs) [default5]:[rank13]: output = self.o_proj(attention_output) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank12]: out = F.linear(input, weight, bias) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank13]: return row_linear( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank13]: out = F.linear(input, weight, bias) ]:[rank5]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank5]: output = self.o_proj(attention_output) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinan[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl dmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank5]: return row_linear( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank13]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank5]: out = F.linear(input, weight, bias) [default5]:[rank5]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.36 GiB is free. Including non-PyTorch memory, this process has 77.96 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank14]: output = self.o_proj(attention_output) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank14]: return row_linear( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank14]: out = F.linear(input, weight, bias) [default6]:[rank14]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default6]:[rank22]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank22]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: output = self.o_proj(attention_output) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: return row_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default0]:[rank0]: out = F.linear(input, weight, bias) [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU [default6]:[rank22]: output = self.pp_block(**new_kwargs) [default6]:[rank6]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default6]:[rank22]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank22]: output = self.o_proj(attention_output) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank22]: return row_linear( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default4]:[rank20]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank22]: out = F.linear(input, weight, bias) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: output = self.pp_block(**new_kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank20]: sharded_logits = self.model( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank20]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank20]: output = self.pp_block(**new_kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank6]: output = self.o_proj(attention_output) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank20]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default6]:[rank6]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank20]: output = self.o_proj(attention_output) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank6]: return row_linear( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank6]: out = F.linear(input, weight, bias) [default4]:[rank20]: return row_linear( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank6]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.47 GiB is free. Including non-PyTorch memory, this process has 77.85 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank20]: out = F.linear(input, weight, bias) [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default4]:[rank20]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nano[default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in tron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1[default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default0]:[rank16]: trainer.train(dataloader) ]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/minif[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train orge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: trainer.train(dataloader) [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank1]: output = self.o_proj(attention_output) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank1]: return row_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_para[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train llel/functional.py", line 474, in row_linear [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: out = F.linear(input, weight, bias) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.36 GiB is free. Including non-PyTorch memory, this process has 77.96 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/na[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train notron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nano[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step tron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank4]: output = self.pp_block(**new_kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( forward [default4]:[rank4]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank4]: output = self.o_proj(attention_output) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: Fi[default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step le "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: return row_linear( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank4]: out = F.linear(input, weight, bias) [default4]:[rank4]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.47 GiB is free. Including non-PyTorch memory, this process has 77.85 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: output = model(**micro_batch) [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank16]: return forward_call(*args, **kwargs) [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank16]: sharded_logits = self.model( [default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank3]: output = self.o_proj(attention_output) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: output = model(**micro_batch) [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: return row_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: out = F.linear(input, weight, bias) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.36 GiB is free. Including non-PyTorch memory, this process has 77.96 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank17]: return forward_call(*args, **kwargs) [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, [default0]:[rank16]: return forward_call(*args, **kwargs) in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: sharded_logits = self.model( [default0]:[rank16]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank7]: output = self.pp_block(**new_kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank7]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default0]:[rank16]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank7]: output = self.o_proj(attention_output) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: return row_linear( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: out = F.linear(input, weight, bias) [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank7]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.36 GiB is free. Including non-PyTorch memory, this process has 77.96 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank17]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank16]: return forward_call(*args, **kwargs) [default7]:[rank23]: sharded_logits = self.model( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: output = self.pp_block(**new_kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank23]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank17]: output = self.pp_block(**new_kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank23]: output = self.pp_block(**new_kwargs) [default1]:[rank25]: Traceback (most recent call last): [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank25]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: trainer.train(dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank27]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank27]: output = model(**micro_batch) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*arg[default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl s, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank25]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: return forward_call(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank25]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank16]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank23]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank25]: output = model(**micro_batch) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default0]:[rank16]: output = self.o_proj(attention_output) [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: return forward_call(*args, **kwargs) [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank25]: sharded_logits = self.model( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: sharded_logits = self.model( [default7]:[rank23]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default0]:[rank16]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank23]: output = self.o_proj(attention_output) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: return forward_call(*args, **kwargs) [default1]:[rank17]: output = self.o_proj(attention_output) [default3]:[rank27]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank25]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank16]: return row_linear( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank25]: return forward_call(*args, **kwargs) [default1]:[rank17]: return forward_call(*args, **kwargs) [default0]:[rank16]: out = F.linear(input, weight, bias) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank27]: output = self.pp_block(**new_kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return row_linear( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank25]: output = self.pp_block(**new_kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: out = F.linear(input, weight, bias) [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: return row_linear( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank25]: return forward_call(*args, **kwargs) [default1]:[rank17]: out = F.linear(input, weight, bias) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank25]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank17]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank25]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank30]: Traceback (most recent call last): [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank19]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank19]: output = model(**micro_batch) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank19]: sharded_logits = self.model( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster[default1]:[rank25]: output = self.o_proj(attention_output) /lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank19]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank19]: output = self.pp_block(**new_kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line[default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 631, in forward [default3]:[rank19]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank19]: output = self.o_proj(attention_output) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [defau[default1]:[rank25]: return self._call_impl(*args, **kwargs) lt3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank30]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank19]: return row_linear( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank19]: out = F.linear(input, weight, bias) [default3]:[rank19]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this process has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: output = self.o_proj(attention_output) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank30]: output = model(**micro_batch) [default1]:[rank25]: return forward_call(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default1]:[rank25]: return row_linear( [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank30]: sharded_logits = self.model( [default7]:[rank31]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank27]: return row_linear( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank18]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default5]:[rank29]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: output = model(**micro_batch) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: out = F.linear(input, weight, bias) [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank18]: sharded_logits = self.model( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotro[default6]:[rank30]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] n/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank18]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank18]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank18]: output = self.pp_block(**new_kwargs) [default1]:[rank25]: out = F.linear(input, weight, bias) [default3]:[rank27]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 861.94 MiB is free. Including non-PyTorch memory, this process has 78.48 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank31]: trainer.train(dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank18]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/py[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in thon3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank18]: output = self.o_proj(attention_output) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: trainer.train(dataloader) [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank18]: return row_linear( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank18]: out = F.linear(input, weight, bias) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 601.94 MiB is free. Including non-PyTorch memory, this process has 78.73 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank30]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank25]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 861.94 MiB is free. Including non-PyTorch memory, this process has 78.48 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank30]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank31]: output = model(**micro_batch) [default6]:[rank30]: output = self.pp_block(**new_kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank29]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank29]: output = model(**micro_batch) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: return forward_call(*args, **kwargs) [default6]:[rank30]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank29]: return forward_call(*args, **kwargs) [default6]:[rank30]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank31]: sharded_logits = self.model( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: sharded_logits = self.model( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank30]: output = self.o_proj(attention_output) [default7]:[rank31]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank31]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank30]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: return row_linear( [default5]:[rank29]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank30]: out = F.linear(input, weight, bias) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: output = self.pp_block(**new_kwargs) [default7]:[rank31]: return forward_call(*args, **kwargs) [default6]:[rank30]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 673.94 MiB is free. Including non-PyTorch memory, this process has 78.66 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank21]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default7]:[rank31]: output = self.pp_block(**new_kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank29]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank21]: output = model(**micro_batch) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank21]: sharded_logits = self.model( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotro[default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward n/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank21]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank21]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward[default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: output = self.pp_block(**new_kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default7]:[rank31]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank21]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank21]: output = self.o_proj(attention_output) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/[default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank21]: return row_linear( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank21]: out = F.linear(input, weight, bias) [default5]:[rank21]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 493.94 MiB is free. Including non-PyTorch memory, this proces[default7]:[rank31]: return self._call_impl(*args, **kwargs) s has 78.84 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank31]: return forward_call(*args, **kwargs) [default5]:[rank29]: output = self.o_proj(attention_output) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: output = self.o_proj(attention_output) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank29]: return row_linear( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]:[rank29]: out = F.linear(input, weight, bias) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank31]: return row_linear( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default7]:[rank31]: out = F.linear(input, weight, bias) [default7]:[rank31]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 861.94 MiB is free. Including non-PyTorch memory, this process has 78.48 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank29]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 861.94 MiB is free. Including non-PyTorch memory, this process has 78.48 GiB memory in use. Of the allocated memory 68.85 GiB is allocated by PyTorch, and 78.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank42]: Traceback (most recent call last): [default5]:[rank45]: Traceback (most recent call last): [default1]:[rank41]: Traceback (most recent call last): [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: trainer.train(dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank42]: trainer.train(dataloader) [default7]:[rank47]: Traceback (most recent call last): [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: trainer.train(dataloader) [default1]:[rank41]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank45]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank45]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank42]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank47]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank42]: output = model(**micro_batch) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank45]: output = model(**micro_batch) [default1]:[rank41]: output = model(**micro_batch) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: return forward_call(*args, **kwargs) [default7]:[rank47]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: return forward_call(*args, **kwargs) [default7]:[rank47]: output = model(**micro_batch) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank45]: sharded_logits = self.model( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: sharded_logits = self.model( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank42]: sharded_logits = self.model( [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: return forward_call(*args, **kwargs) [default7]:[rank47]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank41]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank45]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank45]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank42]: return forward_call(*args, **kwargs) [default7]:[rank47]: sharded_logits = self.model( [default1]:[rank41]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank41]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank45]: return forward_call(*args, **kwargs) [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: Traceback (most recent call last): [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank40]: trainer.train(dataloader) [default2]:[rank42]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank47]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank42]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: pipeline_state.run_communication() [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank41]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: recv_activation_tensor = recv_activation() [default0]:[rank40]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank42]: return forward_call(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank41]: pipeline_state.run_communication() [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank47]: return forward_call(*args, **kwargs) [default0]:[rank40]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank47]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank42]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank41]: recv_activation_tensor = recv_activation() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank47]: pipeline_state.run_communication() [default0]:[rank40]: output = model(**micro_batch) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank42]: pipeline_state.run_communication() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank45]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank42]: recv_activation_tensor = recv_activation() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank41]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: recv_activation_tensor = recv_activation() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank42]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank41]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank47]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank45]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank42]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank41]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank47]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: dist.recv( [default0]:[rank40]: return forward_call(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank42]: dist.recv( [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank40]: sharded_logits = self.model( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: return func(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: return func(*args, **kwargs) [default7]:[rank47]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: return forward_call(*args, **kwargs) [default5]:[rank45]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank45]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank41]: dist.recv( [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank40]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank42]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank45]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0477600897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank42]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: frame #1: + 0x5b3a23e (0x7f04b111d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: return func(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank42]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank47]: dist.recv( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f04b1117c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f04b1117f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8fe8fe7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank42]: frame #1: + 0x5b3a23e (0x7f9022b0423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f9022afec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: return forward_call(*args, **kwargs) [default2]:[rank42]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9022afef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: return func(*args, **kwargs) [default1]:[rank41]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank45]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f04b1118fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f04b10cd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9022afffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank45]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f04b10cd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank41]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9022ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9022ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank42]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9022ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank41]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a820b2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank42]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9022ab4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank45]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f04b10cd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f04b10cd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8fea2c1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank42]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8fea2c8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank47]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb7aaa33897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank33]: Traceback (most recent call last): [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank33]: trainer.train(dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank33]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank33]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank33]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default2]:[rank42]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8fea2e7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: pipeline_state.run_communication() ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank33]: output = model(**micro_batch) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank33]: sharded_logits = self.model( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [def[default5]:[rank45]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f04788da189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f04788e1610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) ault1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: recv_activation_tensor = recv_activation() [default1]:[rank33]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank33]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank33]: new_kwargs[name] = recv_from[default5]:[rank45]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0478900978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) _pipeline_state_buffer( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank33]: pipeline_state.run_communication() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank33]: recv_activation_tensor = recv_activation() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank33]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank33]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag[default5]:[rank45]: frame #12: + 0x5adc309 (0x7f04b10bf309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #12: + 0x5adc309 (0x7f9022aa6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) =tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank33]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank33]: dist.recv( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank33]: return func(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank33]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank33]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] [default7]:[rank47]: frame #1: + 0x5b3a23e (0x7fb7e455023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank33]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank33]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1d11395897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb7e454ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #1: + 0x5b3a23e (0x7f1d4aeb223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1d4aeacc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1d4aeacf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1d4aeadfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1d4ae62371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1d4ae62371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1d4ae62371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1d4ae62371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1d1266f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f1d12676610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1d12695978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #12: + 0x5adc309 (0x7f1d4ae54309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #13: + 0x5ae6f10 (0x7f1d4ae5ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #14: + 0x5ae6fa5 (0x7f1d4ae5efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #15: + 0x5124446 (0x7f1d4a49c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #16: + 0x1acf4b8 (0x7f1d46e474b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #17: + 0x5aee004 (0x7f1d4ae66004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #18: + 0x5af36b5 (0x7f1d4ae6b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #19: + 0xd2631e (0x7f1d5da5531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #20: + 0x47def4 (0x7f1d5d1acef4 in /fsx/fe[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ rdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #21: + 0x1445a6 (0x55b9cf0545a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b9cf04da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #23: + 0x150866 (0x55b9cf060866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b9cf049142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b9cf054a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #26: PyObject_Call + 0xbc (0x55b9cf060f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #27: [default5]:[rank45]: frame #13: + 0x5ae6f10 (0x7f04b10c9f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) _PyEval_EvalFrameDefault + 0x2d83 (0x55b9cf0472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b9cf054a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #30: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #32: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.1[default1]:[rank41]: frame #1: + 0x5b3a23e (0x7f4abbbcf23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 0) [default1]:[rank33]: frame #34: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b9cf04cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b9cf05ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #38: + 0x211239 (0x55b9cf121239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b9cf04da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b9cf0493e6 in /fsx/ferdinandmom/miniforge3/envs[default1]:[rank41]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f4abbbc9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) /env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #13: + 0x5ae6f10 (0x7f9022ab0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b9cf054a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b9cf044c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b9cf054a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #45: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #46: PyObject_Call + 0xbc (0x55b9cf060f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9cf0472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-clu[default5]:[rank45]: frame #14: + 0x5ae6fa5 (0x7f04b10c9fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ster/bin/python3.10) [default1]:[rank41]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4abbbc9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #48: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #49: PyObject_Call + 0xbc (0x55b9cf060f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9cf0472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b9cf054a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b9cf04d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4abbbcafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b9cf05ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #54: + 0x211239 (0x55b9cf121239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #55: PyObject_Call + 0x207 (0x55b9cf061067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4abbb7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9cf0472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #57: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b9cf0458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4abbb7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #59: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #60: PyObject_Call + 0xbc (0x55b9cf060f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank33]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b9cf0472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #62: + 0x150582 (0x55b9cf060582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #63: PyObject_Call + 0xbc (0x55b9cf060f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank41]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4abbb7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4abbb7f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4a8338c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: Traceback (most recent call last): [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank32]: trainer.train(dataloader) [default1]:[rank41]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4a83393610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank32]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: frame #14: + 0x5ae6fa5 (0x7f9022ab0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb7e454af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank32]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank32]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank32]: output = model(**micro_batch) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/py[default5]:[rank45]: frame #15: + 0x5124446 (0x7f04b0707446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) thon3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: return forward_call(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank32]: sharded_logits = self.model( [default0]:[rank40]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank32]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank43]: Traceback (most recent call last): [default0]:[rank40]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank32]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank32]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f4a833b2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: frame #12: + 0x5adc309 (0x7f4abbb71309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank32]: return forward_call(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: frame #13: + 0x5ae6f10 (0x7f4abbb7bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: dist.recv( [default0]:[rank32]: pipeline_state.run_communication() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: frame #16: + 0x1acf4b8 (0x7f04ad0b24b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #17: + 0x5aee004 (0x7f04b10d1004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: recv_activation_tensor = recv_activation() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank32]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank32]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb7e454bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank32]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank47]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb7e4500371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: dist.recv( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank44]: Traceback (most recent call last): [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: frame #14: + 0x5ae6fa5 (0x7f4abbb7bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #15: + 0x5124446 (0x7f4abb1b9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: return func(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank32]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank46]: Traceback (most recent call last): [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: frame #18: + 0x5af36b5 (0x7f04b10d66b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank32]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank45]: frame #19: + 0xd2631e (0x7f04c3cc031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #20: + 0x47def4 (0x7f04c3417ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f144ae66897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank32]: frame #1: + 0x5b3a23e (0x7f148498323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank32]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f148497dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f148497df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: trainer.train(dataloader) [default0]:[rank32]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f148497efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1484933371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1484933371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb7e4500371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1484933371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1484933371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f144c140189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank42]: frame #15: + 0x5124446 (0x7f90220ee446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #16: + 0x1acf4b8 (0x7f901ea994b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f144c147610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f144c166978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank32]: frame #12: + 0x5adc309 (0x7f1484925309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #13: + 0x5ae6f10 (0x7f148492ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #16: + 0x1acf4b8 (0x7f4ab7b644b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #14: + 0x5ae6fa5 (0x7f148492ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #15: + 0x5124446 (0x7f1483f6d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #16: + 0x1acf4b8 (0x7f14809184b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #17: + 0x5aee004 (0x7f1484937004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: trainer.train(dataloader) [default3]:[rank43]: trainer.train(dataloader) [default0]:[rank40]: return func(*args, **kwargs) [default0]:[rank32]: frame #18: + 0x5af36b5 (0x7f148493c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #19: + 0xd2631e (0x7f149752631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: frame #20: + 0x47def4 (0x7f1496c7def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank45]: frame #21: + 0x1445a6 (0x5578e7e965a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #21: + 0x1445a6 (0x55699b8525a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55699b84ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank32]: frame #23: + 0x150866 (0x55699b85e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55699b847142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55699b852a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb7e4500371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #26: PyObject_Call + 0xbc (0x55699b85ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55699b8452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55699b852a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5578e7e8fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #23: + 0x150866 (0x5578e7ea2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #30: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #32: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #34: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55699b84af50 in /fsx/ferdinandmom/miniforge3/en[default5]:[rank45]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5578e7e8b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) vs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank32]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55699b85cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #38: + 0x211239 (0x55699b91f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55699b84ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55699b8473e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #17: + 0x5aee004 (0x7f4abbb83004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55699b852a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55699b842c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55699b852a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #18: + 0x5af36b5 (0x7f4abbb886b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #19: + 0xd2631e (0x7f4ace77231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #45: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank32]: frame #46: PyObject_Call + 0xbc (0x55699b85ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55699b8452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #48: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #17: + 0x5aee004 (0x7f9022ab8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #18: + 0x5af36b5 (0x7f9022abd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #49: PyObject_Call + 0xbc (0x55699b85ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55699b8452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55699b852a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55699b84b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55699b85cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #54: + 0x211239 (0x55699b91f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #20: + 0x47def4 (0x7f4acdec9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank42]: frame #19: + 0xd2631e (0x7f90356a731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank32]: frame #55: PyObject_Call + 0x207 (0x55699b85f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55699b8452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #57: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb7e4500371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank32]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55699b8438fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #59: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5578e7e96a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #60: PyObject_Call + 0xbc (0x55699b85ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55699b8452b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #62: + 0x150582 (0x55699b85e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank32]: frame #63: PyObject_Call + 0xbc (0x55699b85ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #26: PyObject_Call + 0xbc (0x5578e7ea2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank32]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank45]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5578e7e892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: Traceback (most recent call last): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: trainer.train(dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank35]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default5]:[rank45]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5578e7e96a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #20: + 0x47def4 (0x7f9034dfeef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank44]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank35]: output = model(**micro_batch) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank35]: sharded_logits = self.model( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: return self._call_impl(*args, **kwargs) [def[default0]:[rank40]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): ault3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank35]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank35]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-clus[default6]:[rank46]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) ter/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank35]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank35]: pipeline_state.run_communication() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank35]: recv_activation_tensor = recv_activation() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank35]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb7abd0d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank35]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank35]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank35]: dist.recv( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: return func(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site[default7]:[rank47]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb7abd14610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) -packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank35]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank35]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank35]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank35]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f785d2ee897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank35]: frame #1: + 0x5b3a23e (0x7f7896e0b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration + 0x1445a6 (0x56442ab8a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) > >) + 0x2c7 (0x7f7896e05c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7896e05f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7896e06fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7896dbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7896dbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #7: c10d::PrefixStore::get(st[default6]:[rank46]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: outputs = self.pipeline_engine.train_batch_iter( d::string const&) + 0x31 (0x7f7896dbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7896dbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f785e5c8189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f785e5cf610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f785e[default1]:[rank41]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56442ab83a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #21: + 0x1445a6 (0x5626ef77b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5626ef774a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 5ee978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #12: + 0x5adc309 (0x7f7896dad309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #13: + 0x5ae6f10 (0x7f7896db7f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #14: + 0x5ae6fa5 (0x7f7896db7fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #15: + 0x5124446 (0x7f78963f5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #16: + 0x1acf4b8 (0x7f7892da04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3[default0]:[rank40]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8479e88897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) .10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #17: + 0x5aee004 (0x7f7896dbf004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #18: + 0x5af36b5 (0x7f7896dc46b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #19: + 0xd2631e (0x7f78a99ae31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #20: + 0x47def4 (0x7f78a9105ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #21: + 0x1445a6 (0x5620dfe665a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter 20dfe5fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #23: + 0x150866 (0x5620dfe72866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5620dfe5b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5620dfe66a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #26: PyObject_Call + 0xbc (0x5620dfe72f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5620dfe592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5620dfe66a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #29: _PyEval_EvalFrame[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank46]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: frame #23: + 0x150866 (0x5626ef787866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) Default + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #30: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #32: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #34: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[ra[default5]:[rank45]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) nk35]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5620dfe5ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5620dfe70c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #38: + 0x211239 (0x5620dff33239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5620dfe5fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5620dfe5b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5620dfe66a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5620dfe56c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step /bin/python3.10) [default3]:[rank35]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5620dfe66a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #45: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #46: PyObject_Call + 0xbc (0x5620dfe72f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5620dfe592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #48: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #49: PyObject_Call + 0xbc (0x5620dfe72f1c in /fsx/ferdinandmom/miniforge3/envs/env-benc[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward h-cluster/bin/python3.10) [default3]:[rank35]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5620dfe592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5620dfe66a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5620dfe5f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5620dfe70c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #54: + 0x211239 (0x5620dff33239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #55: PyObject_Call + 0x207 (0x5620dfe73067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5620dfe592b3 in /fsx/ferdinandmo[default3]:[rank43]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) m/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #57: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5620dfe578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #59: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #60: PyObject_Call + 0xbc (0x5620dfe72f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5620dfe592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #62: + 0x150582 (0x5620dfe72582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #63: PyObject_Call + 0xbc (0x5620dfe72f1c in /fsx/fe[default5]:[rank45]: frame #30: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb7abd33978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: outputs = self.pipeline_engine.train_batch_iter( rdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank36]: Traceback (most recent call last): [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank36]: trainer.train(dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank36]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank36]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank36]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default1]:[rank41]: frame #23: + 0x150866 (0x56442ab96866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank36]: output = model(**micro_batch) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank36]: sharded_logits = self.model( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [def[default1]:[rank41]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56442ab7f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #1: + 0x5b3a23e (0x7f84b39a523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ault4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank36]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank44]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank45]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: output = model(**micro_batch) [default0]:[rank40]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f84b399fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank36]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank36]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: output = model(**micro_batch) nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank36]: pipeline_state.run_communication() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank36]: recv_activation_tensor = recv_activation() [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank36]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank36]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/[default1]:[rank41]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56442ab8aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #26: PyObject_Call + 0xbc (0x56442ab96f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56442ab7d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank36]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank36]: dist.recv( [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank36]: return func(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank36]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank36]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank36]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdfba9db897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank36]: frame #1: + 0x5b3a23e (0x7fdff44f823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fdff44f2c87 in /fsx/ferdinandmom/miniforge3/envs/env[default2]:[rank42]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5626ef770142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #12: + 0x5adc309 (0x7fb7e44f2309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) -bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdff44f2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdff44f3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdff44a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdff44a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdff44a8371 in /fsx/ferdinandmom/minifor[default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdff44a8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fdfbbcb5189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fdfbbcbc610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fdfbbcdb978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/pyt[default3]:[rank43]: output = model(**micro_batch) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl hon3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #32: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56442ab8aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #12: + 0x5adc309 (0x7fdff449a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #13: + 0x5ae6f10 (0x7fdff44a4f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: frame #14: + 0x5ae6fa5 (0x7fdff44a4fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #15: + 0x5124446 (0x7fdff3ae2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #16: + 0x1acf4b8 (0x7fdff048d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5626ef77ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #13: + 0x5ae6f10 (0x7fb7e44fcf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #17: + 0x5aee004 (0x7fdff44ac004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #18: + 0x5af36b5 (0x7fdff44b16b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #19: + 0xd2631e (0x7fe00709b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #20: + 0x47def4 (0x7fe0067f2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: frame #21: + 0x1445a6 (0x5602b05795a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5602b0572a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #23: + 0x150866 (0x5602b0585866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5602b056e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5602b0579a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #26: PyObject_Call + 0xbc (0x5602b0585f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5602b056c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster[default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl /bin/python3.10) [default4]:[rank36]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5602b0579a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #30: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #32: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #34: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/mi[default5]:[rank45]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) niforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5602b0571f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5602b0583c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #38: + 0x211239 (0x5602b0646239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5602b0572a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5602b056e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5602[default5]:[rank45]: frame #34: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) b0579a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5602b0569c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5602b0579a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #45: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #46: PyObject_Call + 0xbc (0x5602b0585f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5602b056c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #48: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #49: PyObject_Call + 0xbc (0x5602b0585f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5602b056c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5602b0579a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5602b0572007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/[default5]:[rank45]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5578e7e8ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) bin/python3.10) [default4]:[rank36]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5602b0583c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #54: + 0x211239 (0x5602b0646239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #55: PyObject_Call + 0x207 (0x5602b0586067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5602b056c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #57: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5602b056a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #59: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #60: PyObject_Call + 0xbc (0x5602b0585f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f84b399ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5602b056c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #62: + 0x150582 (0x5602b0585582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #63: PyObject_Call + 0xbc (0x5602b0585f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: return forward_call(*args, **kwargs) [default4]:[rank36]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank45]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5578e7ea0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: return forward_call(*args, **kwargs) [default1]:[rank41]: frame #30: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank42]: frame #26: PyObject_Call + 0xbc (0x5626ef787f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #38: + 0x211239 (0x5578e7f63239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f84b39a0fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f84b3955371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #32: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5626ef76e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5578e7e8fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank47]: frame #14: + 0x5ae6fa5 (0x7fb7e44fcfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #15: + 0x5124446 (0x7fb7e3b3a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: sharded_logits = self.model( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default2]:[rank42]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5626ef77ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #16: + 0x1acf4b8 (0x7fb7e04e54b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: sharded_logits = self.model( [default1]:[rank41]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank42]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5578e7e8b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5578e7e96a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #34: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #30: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default7]:[rank47]: frame #17: + 0x5aee004 (0x7fb7e4504004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #18: + 0x5af36b5 (0x7fb7e45096b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: sharded_logits = self.model( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #19: + 0xd2631e (0x7fb7f70f331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5578e7e86c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5578e7e96a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank41]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56442ab82f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #32: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #20: + 0x47def4 (0x7fb7f684aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56442ab94c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f84b3955371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f84b3955371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: frame #21: + 0x1445a6 (0x55fa99b705a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55fa99b69a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #38: + 0x211239 (0x56442ac57239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: frame #45: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #23: + 0x150866 (0x55fa99b7c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55fa99b65142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default5]:[rank45]: frame #46: PyObject_Call + 0xbc (0x5578e7ea2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f84b3955371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank43]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank47]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55fa99b70a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56442ab83a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank40]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f847b162189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56442ab7f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56442ab8aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f847b169610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: Traceback (most recent call last): [default5]:[rank45]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5578e7e892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #34: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f847b188978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank37]: Traceback (most recent call last): [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank39]: Traceback (most recent call last): [default6]:[rank38]: trainer.train(dataloader) [default2]:[rank34]: Traceback (most recent call last): [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: trainer.train(dataloader) [default1]:[rank41]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56442ab7ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56442ab8aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank47]: frame #26: PyObject_Call + 0xbc (0x55fa99b7cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank38]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: trainer.train(dataloader) [default2]:[rank42]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5626ef773f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #48: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: return forward_call(*args, **kwargs) [default7]:[rank47]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55fa99b632b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: trainer.train(dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank45]: frame #49: PyObject_Call + 0xbc (0x5578e7ea2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5626ef785c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: frame #38: + 0x211239 (0x5626ef848239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: return forward_call(*args, **kwargs) [default1]:[rank41]: frame #45: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #12: + 0x5adc309 (0x7f84b3947309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5626ef774a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank37]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: frame #13: + 0x5ae6f10 (0x7f84b3951f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank45]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5578e7e892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: Traceback (most recent call last): [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank38]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55fa99b70a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank43]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank34]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank37]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank40]: frame #14: + 0x5ae6fa5 (0x7f84b3951fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank41]: frame #46: PyObject_Call + 0xbc (0x56442ab96f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank43]: pipeline_state.run_communication() [default0]:[rank40]: frame #15: + 0x5124446 (0x7f84b2f8f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5578e7e96a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank42]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5626ef7703e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5626ef77ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5626ef76bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5578e7e8f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5578e7ea0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #16: + 0x1acf4b8 (0x7f84af93a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #17: + 0x5aee004 (0x7f84b3959004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56442ab7d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #30: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: trainer.train(dataloader) [default5]:[rank61]: Traceback (most recent call last): [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank39]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank38]: output = model(**micro_batch) [default0]:[rank40]: frame #18: + 0x5af36b5 (0x7f84b395e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #19: + 0xd2631e (0x7f84c654831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank61]: trainer.train(dataloader) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank61]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: return forward_call(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank61]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: frame #48: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank37]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank47]: frame #32: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: recv_activation_tensor = recv_activation() [default1]:[rank57]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank39]: output = model(**micro_batch) [default2]:[rank34]: output = model(**micro_batch) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank57]: output = model(**micro_batch) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5626ef77ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #49: PyObject_Call + 0xbc (0x56442ab96f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56442ab7d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: frame #54: + 0x211239 (0x5578e7f63239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: frame #20: + 0x47def4 (0x7f84c5c9fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank61]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank37]: output = model(**micro_batch) [default6]:[rank46]: pipeline_state.run_communication() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank61]: output = model(**micro_batch) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: frame #55: PyObject_Call + 0x207 (0x5578e7ea3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: sharded_logits = self.model( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: frame #45: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56442ab8aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56442ab83007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5578e7e892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank42]: frame #46: PyObject_Call + 0xbc (0x5626ef787f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank61]: sharded_logits = self.model( [default2]:[rank34]: return forward_call(*args, **kwargs) [default0]:[rank40]: frame #21: + 0x1445a6 (0x562d2eccf5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank41]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56442ab94c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank44]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank57]: return forward_call(*args, **kwargs) [default6]:[rank38]: sharded_logits = self.model( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank39]: return forward_call(*args, **kwargs) [default5]:[rank45]: frame #57: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank37]: return forward_call(*args, **kwargs) [default6]:[rank46]: recv_activation_tensor = recv_activation() [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: sharded_logits = self.model( [default7]:[rank47]: frame #34: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5626ef76e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank61]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank39]: sharded_logits = self.model( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5578e7e878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: frame #59: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #22: _PyObject_MakeTpCall + 0x26b (0x562d2ecc8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: frame #54: + 0x211239 (0x56442ac57239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank42]: frame #48: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank61]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: return forward_call(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default0]:[rank40]: frame #23: + 0x150866 (0x562d2ecdb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55fa99b68f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: frame #60: PyObject_Call + 0xbc (0x5578e7ea2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #55: PyObject_Call + 0x207 (0x56442ab97067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: pipeline_state.run_communication() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank34]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank41]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56442ab7d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank45]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5578e7e892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank42]: frame #49: PyObject_Call + 0xbc (0x5626ef787f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5626ef76e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: frame #62: + 0x150582 (0x5578e7ea2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: recv_activation_tensor = recv_activation() [default7]:[rank39]: return forward_call(*args, **kwargs) [default6]:[rank46]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: return forward_call(*args, **kwargs) [default2]:[rank34]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55fa99b7ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #38: + 0x211239 (0x55fa99c3d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank37]: sharded_logits = self.model( [default1]:[rank41]: frame #57: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5626ef77ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank46]: dist.recv( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank47]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55fa99b69a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank41]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56442ab7b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank43]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: frame #63: PyObject_Call + 0xbc (0x5578e7ea2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55fa99b653e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5626ef774007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #59: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank34]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank40]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562d2ecc4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #25: _PyFunction_Vectorcall + 0x6c (0x562d2eccfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank37]: return forward_call(*args, **kwargs) [default0]:[rank40]: frame #26: PyObject_Call + 0xbc (0x562d2ecdbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #60: PyObject_Call + 0xbc (0x56442ab96f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56442ab7d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: pipeline_state.run_communication() [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: dist.recv( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank46]: return func(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: return func(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank42]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5626ef785c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank34]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank40]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562d2ecc22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank41]: frame #62: + 0x150582 (0x56442ab96582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank34]: pipeline_state.run_communication() [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank43]: dist.recv( [default1]:[rank57]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank37]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank46]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #28: _PyFunction_Vectorcall + 0x6c (0x562d2eccfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f42bb833897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank61]: recv_activation_tensor = recv_activation() [default6]:[rank38]: return forward_call(*args, **kwargs) [default2]:[rank42]: frame #54: + 0x211239 (0x5626ef848239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return func(*args, **kwargs) [default5]:[rank61]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank42]: frame #55: PyObject_Call + 0x207 (0x5626ef788067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #1: + 0x5b3a23e (0x7f42f535023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank62]: Traceback (most recent call last): [default5]:[rank53]: Traceback (most recent call last): [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: recv_activation_tensor = recv_activation() [default5]:[rank37]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank47]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55fa99b70a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55fa99b60c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f42f534ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55fa99b70a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f42f534af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: return forward_call(*args, **kwargs) [default6]:[rank62]: trainer.train(dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: frame #63: PyObject_Call + 0xbc (0x56442ab96f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: return forward_call(*args, **kwargs) [default1]:[rank41]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank43]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank57]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f42f534bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank40]: frame #30: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank44]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank46]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank53]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwarg[default2]:[rank34]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): s) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: sharded_logits = self.model( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank38]: pipeline_state.run_communication() [default3]:[rank43]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff024972897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank53]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank53]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank59]: Traceback (most recent call last): [default5]:[rank61]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank43]: frame #1: + 0x5b3a23e (0x7ff05e48f23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: Traceback (most recent call last): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #32: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f42f5300371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: trainer.train(dataloader) [default2]:[rank34]: dist.recv( [default2]:[rank42]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5626ef76e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank39]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank42]: frame #57: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank37]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank46]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f621f990897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: Traceback (most recent call last): [default5]:[rank53]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank38]: recv_activation_tensor = recv_activation() [default3]:[rank43]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7ff05e489c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f42f5300371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank49]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank37]: pipeline_state.run_communication() [default4]:[rank44]: pipeline_state.run_communication() [default2]:[rank42]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5626ef76c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank53]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank43]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff05e489f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #1: + 0x5b3a23e (0x7f62594ad23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: trainer.train(dataloader) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank34]: return func(*args, **kwargs) [default2]:[rank42]: frame #59: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank49]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff05e48afd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f62594a7c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: recv_activation_tensor = recv_activation() [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank47]: frame #45: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: dist.recv( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank38]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff05e43f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f42f5300371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank37]: recv_activation_tensor = recv_activation() [default3]:[rank43]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff05e43f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: output = model(**micro_batch) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank42]: frame #60: PyObject_Call + 0xbc (0x5626ef787f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #34: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: pipeline_state.run_communication() [default0]:[rank40]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank58]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f42f5300371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank46]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f62594a7f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank43]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff05e43f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562d2ecc7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank34]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5626ef76e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f62594a8fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: dist.recv( [default6]:[rank38]: dist.recv( [default6]:[rank46]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f625945d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank61]: return func(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: recv_activation_tensor = recv_activation() [default7]:[rank47]: frame #46: PyObject_Call + 0xbc (0x55fa99b7cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank42]: frame #62: + 0x150582 (0x5626ef787582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f42bcb0d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: sharded_logits = self.model( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: recv_activation_tensor = recv_activation() [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f939e0be897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55fa99b632b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f42bcb14610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return func(*args, **kwargs) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #37: _PyObject_Call_Prepend + 0x69 (0x562d2ecd9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank39]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank34]: frame #1: + 0x5b3a23e (0x7f93d7bdb23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #63: PyObject_Call + 0xbc (0x5626ef787f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: trainer.train(dataloader) [default5]:[rank61]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank49]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank38]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: return forward_call(*args, **kwargs) [default5]:[rank53]: return func(*args, **kwargs) [default2]:[rank34]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f93d7bd5c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank43]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff05e43f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank38]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank43]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff025c4c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #48: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank49]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank37]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank46]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f625945d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank38]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank47]: frame #49: PyObject_Call + 0xbc (0x55fa99b7cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: sharded_logits = self.model( [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55fa99b632b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank59]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank58]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank61]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb615544897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank53]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank43]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7ff025c53610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #1: + 0x5b3a23e (0x7fb64f06123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank38]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5a47f4a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank42]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fb64f05bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank46]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f625945d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f42bcb33978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f17a6acd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank39]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #38: + 0x211239 (0x562d2ed9c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #39: _PyObject_MakeTpCall + 0x26b (0x562d2ecc8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: frame #1: + 0x5b3a23e (0x7f5a81a6723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562d2ecc43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7ff025c72978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb64f05bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #1: + 0x5b3a23e (0x7f17e05ea23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank47]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55fa99b70a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: return forward_call(*args, **kwargs) [default2]:[rank34]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f93d7bd5f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #41: _PyFunction_Vectorcall + 0x6c (0x562d2eccfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return forward_call(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: frame #12: + 0x5adc309 (0x7f42f52f2309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562d2ecbfc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #12: + 0x5adc309 (0x7ff05e431309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #13: + 0x5ae6f10 (0x7ff05e43bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: output = model(**micro_batch) [default1]:[rank49]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank38]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5a81a61c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #43: _PyFunction_Vectorcall + 0x6c (0x562d2eccfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #13: + 0x5ae6f10 (0x7f42f52fcf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f17e05e4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f93d7bd6fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f17e05e4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank37]: dist.recv( [default6]:[rank46]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f625945d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f17e05e5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5a81a61f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93d7b8b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6220c6a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank59]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank47]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55fa99b69007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: recv_activation_tensor = recv_activation() [default2]:[rank34]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93d7b8b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6220c71610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank37]: return func(*args, **kwargs) [default6]:[rank38]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5a81a62fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6220c90978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: output = model(**micro_batch) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17e059a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93d7b8b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55fa99b7ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: frame #14: + 0x5ae6fa5 (0x7f42f52fcfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank40]: frame #45: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb64f05cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank49]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #46: PyObject_Call + 0xbc (0x562d2ecdbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #15: + 0x5124446 (0x7f42f493a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank53]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17e059a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f93d7b8b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #14: + 0x5ae6fa5 (0x7ff05e43bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #15: + 0x5124446 (0x7ff05da79446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb64f011371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17e059a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5a81a17371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: frame #54: + 0x211239 (0x55fa99c3d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank61]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb64f011371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank34]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f939f398189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562d2ecc22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #55: PyObject_Call + 0x207 (0x55fa99b7d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb64f011371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: Traceback (most recent call last): [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank39]: dist.recv( [default5]:[rank37]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank46]: frame #12: + 0x5adc309 (0x7f625944f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank49]: dist.recv( [default4]:[rank52]: trainer.train(dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank52]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f939f39f610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #13: + 0x5ae6f10 (0x7f6259459f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank53]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17e059a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank44]: dist.recv( [default7]:[rank55]: Traceback (most recent call last): [default2]:[rank34]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f939f3be978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55fa99b632b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f17a7da7189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank37]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5b77029897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]: frame #57: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: return func(*args, **kwargs) [default2]:[rank34]: frame #12: + 0x5adc309 (0x7f93d7b7d309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: return func(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #48: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f17a7dae610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank34]: frame #13: + 0x5ae6f10 (0x7f93d7b87f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: return func(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank53]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f17a7dcd978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: frame #1: + 0x5b3a23e (0x7f5bb0b4623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55fa99b618fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: trainer.train(dataloader) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5bb0b40c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #14: + 0x5ae6fa5 (0x7f6259459fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: frame #12: + 0x5adc309 (0x7f17e058c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5bb0b40f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #15: + 0x5124446 (0x7f6258a97446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank53]: frame #13: + 0x5ae6f10 (0x7f17e0596f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank38]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5a81a17371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #49: PyObject_Call + 0xbc (0x562d2ecdbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562d2ecc22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank61]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb64f011371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb61681e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank37]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5bb0b41fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank57]: frame #16: + 0x1acf4b8 (0x7f42f12e54b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank39]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank44]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank37]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5bb0af6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #59: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe6b44ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank34]: frame #14: + 0x5ae6fa5 (0x7f93d7b87fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #16: + 0x1acf4b8 (0x7f62554424b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fb616825610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #1: + 0x5b3a23e (0x7fe6edfc823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: frame #15: + 0x5124446 (0x7f93d71c5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f053b2c2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank59]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank37]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5bb0af6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #16: + 0x1acf4b8 (0x7ff05a4244b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: frame #17: + 0x5aee004 (0x7f42f5304004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5a81a17371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank43]: frame #17: + 0x5aee004 (0x7ff05e443004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #60: PyObject_Call + 0xbc (0x55fa99b7cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #51: _PyFunction_Vectorcall + 0x6c (0x562d2eccfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #1: + 0x5b3a23e (0x7f0574ddf23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #18: + 0x5af36b5 (0x7f42f53096b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #19: + 0xd2631e (0x7f4307ef331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank37]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5bb0af6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #16: + 0x1acf4b8 (0x7f93d3b704b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #18: + 0x5af36b5 (0x7ff05e4486b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fb616844978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank40]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562d2ecc8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #12: + 0x5adc309 (0x7fb64f003309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: Traceback (most recent call last): [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: frame #20: + 0x47def4 (0x7f430764aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #21: + 0x1445a6 (0x5565dabca5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #17: + 0x5aee004 (0x7f93d7b8f004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #17: + 0x5aee004 (0x7f6259461004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: sharded_logits = self.model( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5565dabc3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5a81a17371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f164d8d3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank46]: frame #18: + 0x5af36b5 (0x7f62594666b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: trainer.train(dataloader) [default2]:[rank34]: frame #18: + 0x5af36b5 (0x7f93d7b946b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0574dd9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #13: + 0x5ae6f10 (0x7fb64f00df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: output = model(**micro_batch) [default5]:[rank37]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5bb0af6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #19: + 0xd2631e (0x7f626c05031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #23: + 0x150866 (0x5565dabd6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank39]: frame #1: + 0x5b3a23e (0x7f16873f023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #19: + 0xd2631e (0x7f93ea77e31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #19: + 0xd2631e (0x7ff07103231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #20: + 0x47def4 (0x7f626b7a7ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5565dabbf142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fe6edfc2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #14: + 0x5ae6fa5 (0x7f17e0596fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5b78303189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #21: + 0x1445a6 (0x560769b205a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank56]: Traceback (most recent call last): [default5]:[rank61]: frame #14: + 0x5ae6fa5 (0x7fb64f00dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #15: + 0x5124446 (0x7fb64e64b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe6edfc2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #20: + 0x47def4 (0x7f93e9ed5ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank47]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55fa99b632b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: sharded_logits = self.model( [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5565dabcaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #16: + 0x1acf4b8 (0x7fb64aff64b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #17: + 0x5aee004 (0x7fb64f015004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #18: + 0x5af36b5 (0x7fb64f01a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe6edfc3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #15: + 0x5124446 (0x7f17dfbd4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5b7830a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #53: _PyObject_Call_Prepend + 0x69 (0x562d2ecd9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #54: + 0x211239 (0x562d2ed9c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default7]:[rank39]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f16873eac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #62: + 0x150582 (0x55fa99b7c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: trainer.train(dataloader) [default5]:[rank61]: frame #19: + 0xd2631e (0x7fb661c0431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: frame #16: + 0x1acf4b8 (0x7f17dc57f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #17: + 0x5aee004 (0x7f17e059e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5b78329978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0574dd9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank39]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f16873eaf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #55: PyObject_Call + 0x207 (0x562d2ecdc067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank55]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank34]: frame #21: + 0x1445a6 (0x55635db255a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #12: + 0x5adc309 (0x7f5bb0ae8309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562d2ecc22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: return forward_call(*args, **kwargs) [default3]:[rank59]: return forward_call(*args, **kwargs) [default5]:[rank61]: frame #20: + 0x47def4 (0x7fb66135bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank61]: frame #21: + 0x1445a6 (0x560b1b97a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560b1b973a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #18: + 0x5af36b5 (0x7f17e05a36b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f16873ebfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0574ddafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: frame #19: + 0xd2631e (0x7f17f318d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5a49224189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560769b19a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank55]: output = model(**micro_batch) [default5]:[rank37]: frame #13: + 0x5ae6f10 (0x7f5bb0af2f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #20: + 0x47def4 (0x7ff070789ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #21: + 0x1445a6 (0x55b6377c15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #23: + 0x150866 (0x560b1b986866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe6edf78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #20: + 0x47def4 (0x7f17f28e4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f16873a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #23: + 0x150866 (0x560769b2c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: frame #26: PyObject_Call + 0xbc (0x5565dabd6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #21: + 0x1445a6 (0x5629e09cc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe6edf78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55635db1ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #63: PyObject_Call + 0xbc (0x55fa99b7cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5565dabbd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank61]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560b1b96f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: frame #14: + 0x5ae6fa5 (0x7f5bb0af2fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b6377baa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return forward_call(*args, **kwargs) [default4]:[rank52]: sharded_logits = self.model( [default7]:[rank39]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f16873a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #57: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank56]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe6edf78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe6edf78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #23: + 0x150866 (0x55635db31866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5a4922b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0574d8f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5629e09c5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f16873a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0574d8f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0574d8f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #15: + 0x5124446 (0x7f5bb0130446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5a4924a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562d2ecc08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #23: + 0x150866 (0x5629e09d8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5629e09c1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe6b5785189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f16873a0371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560769b15142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560769b20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: frame #16: + 0x1acf4b8 (0x7f5bacadb4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #26: PyObject_Call + 0xbc (0x560769b2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #59: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #60: PyObject_Call + 0xbc (0x562d2ecdbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe6b578c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fe6b57ab978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #12: + 0x5adc309 (0x7f5a81a09309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank52]: return forward_call(*args, **kwargs) [default7]:[rank39]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f164ebad189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560769b132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #23: + 0x150866 (0x55b6377cd866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #12: + 0x5adc309 (0x7fe6edf6a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55635db1a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0574d8f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank58]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank49]: frame #13: + 0x5ae6f10 (0x7fe6edf74f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f164ebb4610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560769b20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5565dabcaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: frame #17: + 0x5aee004 (0x7f5bb0afa004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #18: + 0x5af36b5 (0x7f5bb0aff6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55635db25a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f053c59c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank39]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f164ebd3978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f053c5a3610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: output = model(**micro_batch) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: frame #13: + 0x5ae6f10 (0x7f5a81a13f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b6377b6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank34]: frame #26: PyObject_Call + 0xbc (0x55635db31f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b6377c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank34]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55635db182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562d2ecc22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #30: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #12: + 0x5adc309 (0x7f1687392309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #19: + 0xd2631e (0x7f5bc36e931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank34]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55635db25a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #14: + 0x5ae6fa5 (0x7f5a81a13fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #62: + 0x150582 (0x562d2ecdb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: frame #13: + 0x5ae6f10 (0x7f168739cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #26: PyObject_Call + 0xbc (0x55b6377cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560b1b97aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b6377b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #26: PyObject_Call + 0xbc (0x560b1b986f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank59]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: frame #30: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #63: PyObject_Call + 0xbc (0x562d2ecdbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #20: + 0x47def4 (0x7f5bc2e40ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank40]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560b1b96d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: frame #14: + 0x5ae6fa5 (0x7f168739cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #32: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f053c5c2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #21: + 0x1445a6 (0x563ccbf265a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #22: _PyObject_MakeTpCall + 0x26b (0x563ccbf1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #12: + 0x5adc309 (0x7f0574d81309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank34]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #34: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b6377c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank39]: frame #15: + 0x5124446 (0x7f16869da446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #30: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #32: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55635db1df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #23: + 0x150866 (0x563ccbf32866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #30: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #13: + 0x5ae6f10 (0x7f0574d8bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560b1b97aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #15: + 0x5124446 (0x7f5a81051446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #30: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #16: + 0x1acf4b8 (0x7f16833854b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #32: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: output = model(**micro_batch) [default1]:[rank57]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563ccbf1b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #34: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return forward_call(*args, **kwargs) [default0]:[rank56]: sharded_logits = self.model( [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: frame #16: + 0x1acf4b8 (0x7f5a7d9fc4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #34: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: sharded_logits = self.model( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank53]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5629e09cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563ccbf26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55635db2fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank62]: pipeline_state.run_communication() [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default6]:[rank38]: frame #17: + 0x5aee004 (0x7f5a81a1b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #14: + 0x5ae6fa5 (0x7f0574d8bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank39]: frame #17: + 0x5aee004 (0x7f16873a4004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #32: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #26: PyObject_Call + 0xbc (0x5629e09d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #38: + 0x211239 (0x55635dbf2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #15: + 0x5124446 (0x7f05743c9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5629e09bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #14: + 0x5ae6fa5 (0x7fe6edf74fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #18: + 0x5af36b5 (0x7f16873a96b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #18: + 0x5af36b5 (0x7f5a81a206b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b6377b9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #15: + 0x5124446 (0x7fe6ed5b2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: return forward_call(*args, **kwargs) [default2]:[rank34]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55635db1ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #16: + 0x1acf4b8 (0x7f0570d744b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank34]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55635db1a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #26: PyObject_Call + 0xbc (0x563ccbf32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #19: + 0xd2631e (0x7f5a9460a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b6377cbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5629e09cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #19: + 0xd2631e (0x7f1699f9331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: frame #17: + 0x5aee004 (0x7f0574d93004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: frame #16: + 0x1acf4b8 (0x7fe6e9f5d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #20: + 0x47def4 (0x7f5a93d61ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank43]: frame #38: + 0x211239 (0x55b63788e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: frame #30: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #20: + 0x47def4 (0x7f16996eaef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank44]: frame #18: + 0x5af36b5 (0x7f0574d986b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b6377baa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: frame #17: + 0x5aee004 (0x7fe6edf7c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #21: + 0x1445a6 (0x55796dd125a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #19: + 0xd2631e (0x7f058798231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5565dabc2f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank52]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55635db25a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b6377b63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: pipeline_state.run_communication() [default5]:[rank53]: frame #32: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #21: + 0x1445a6 (0x5578422375a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #20: + 0x47def4 (0x7f05870d9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank58]: return forward_call(*args, **kwargs) [default5]:[rank53]: frame #34: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank34]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55635db15c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b6377c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5565dabd4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55635db25a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563ccbf192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #22: _PyObject_MakeTpCall + 0x26b (0x557842230a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #21: + 0x1445a6 (0x55bd46d555a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b6377b1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #34: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #32: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #18: + 0x5af36b5 (0x7fe6edf816b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #19: + 0xd2631e (0x7fe700b6b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55796dd0ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b6377c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank52]: return forward_call(*args, **kwargs) [default7]:[rank55]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank39]: frame #23: + 0x150866 (0x557842243866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: recv_activation_tensor = recv_activation() [default5]:[rank53]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5629e09c4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #23: + 0x150866 (0x55796dd1e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #45: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return forward_call(*args, **kwargs) [default5]:[rank61]: frame #34: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55796dd07142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #45: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #46: PyObject_Call + 0xbc (0x55635db31f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55bd46d4ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #20: + 0x47def4 (0x7fe7002c2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank49]: frame #21: + 0x1445a6 (0x55a1dae565a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55796dd12a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #46: PyObject_Call + 0xbc (0x55b6377cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: recv_activation_tensor = recv_activation() [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55784222c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #23: + 0x150866 (0x55bd46d61866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560769b18f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b6377b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560b1b972f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: frame #25: _PyFunction_Vectorcall + 0x6c (0x557842237a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560769b2ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank57]: frame #38: + 0x211239 (0x5565dac97239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5629e09d6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55635db182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #48: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #49: PyObject_Call + 0xbc (0x55b6377cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5565dabc3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #38: + 0x211239 (0x5629e0a99239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #26: PyObject_Call + 0xbc (0x55796dd1ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563ccbf26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #38: + 0x211239 (0x560769bed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank34]: frame #48: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560769b19a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b6377b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bd46d4a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55bd46d55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a1dae4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #23: + 0x150866 (0x55a1dae62866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55796dd052b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b6377c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank49]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a1dae4b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: pipeline_state.run_communication() [default5]:[rank37]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #26: PyObject_Call + 0xbc (0x55bd46d61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: frame #30: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b6377ba007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560769b153e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank55]: return forward_call(*args, **kwargs) [default5]:[rank53]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5629e09c5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #49: PyObject_Call + 0xbc (0x55635db31f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b6377cbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: recv_activation_tensor = recv_activation() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank38]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55796dd12a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #54: + 0x211239 (0x55b63788e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank61]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560b1b984c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a1dae56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank39]: frame #26: PyObject_Call + 0xbc (0x557842243f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #55: PyObject_Call + 0x207 (0x55b6377ce067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560769b20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank62]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5629e09c13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5629e09cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55635db182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b6377b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5629e09bcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank52]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank38]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560769b10c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #26: PyObject_Call + 0xbc (0x55a1dae62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5629e09cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #30: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55784222a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #57: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd46d482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank34]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55635db25a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #32: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b6377b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank39]: frame #28: _PyFunction_Vectorcall + 0x6c (0x557842237a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55bd46d55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560769b20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank55]: pipeline_state.run_communication() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank37]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #59: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5565dabbf3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5565dabcaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #60: PyObject_Call + 0xbc (0x55b6377cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #30: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: sharded_logits = self.model( [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: frame #30: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #45: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: dist.recv( [default1]:[rank57]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5565dabbac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #38: + 0x211239 (0x560b1ba47239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank34]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55635db1e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #32: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55635db2fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #34: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b6377b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #46: PyObject_Call + 0xbc (0x560769b2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560769b132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: pipeline_state.run_communication() [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank38]: frame #32: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #62: + 0x150582 (0x55b6377cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #34: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return forward_call(*args, **kwargs) [default5]:[rank53]: frame #45: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #54: + 0x211239 (0x55635dbf2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #63: PyObject_Call + 0xbc (0x55b6377cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #48: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560b1b973a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560b1b96f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return func(*args, **kwargs) [default5]:[rank53]: frame #46: PyObject_Call + 0xbc (0x5629e09d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #34: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank44]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #49: PyObject_Call + 0xbc (0x560769b2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5565dabcaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: recv_activation_tensor = recv_activation() [default5]:[rank37]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563ccbf1ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bd46d4df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: frame #55: PyObject_Call + 0x207 (0x55635db32067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55bd46d5fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #38: + 0x211239 (0x55bd46e22239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank55]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank53]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5629e09bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #48: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55635db182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #37: _PyObject_Call_Prepend + 0x69 (0x563ccbf30c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #32: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560769b132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560769b20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank55]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank38]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560769b19007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560769b2ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560b1b97aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #49: PyObject_Call + 0xbc (0x5629e09d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55796dd0af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #54: + 0x211239 (0x560769bed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560b1b96ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5629e09bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #38: + 0x211239 (0x563ccbff3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55bd46d4ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank38]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55796dd1cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bd46d4a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #55: PyObject_Call + 0x207 (0x560769b2d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank34]: frame #57: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55bd46d55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bd46d45c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55bd46d55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560769b132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank58]: recv_activation_tensor = recv_activation() [default5]:[rank53]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5629e09cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5629e09c5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5629e09d6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank38]: frame #38: + 0x211239 (0x55796dddf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55635db168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #45: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: dist.recv( [default7]:[rank39]: frame #34: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #57: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560769b118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #59: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #60: PyObject_Call + 0xbc (0x560769b2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: dist.recv( [default5]:[rank61]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560b1b97aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #45: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: dist.recv( [default5]:[rank37]: frame #39: _PyObject_MakeTpCall + 0x26b (0x563ccbf1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560769b132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank53]: frame #54: + 0x211239 (0x5629e0a99239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1dae492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #59: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #46: PyObject_Call + 0xbc (0x55bd46d61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563ccbf1b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #62: + 0x150582 (0x560769b2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #46: PyObject_Call + 0xbc (0x5565dabd6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: return func(*args, **kwargs) [default5]:[rank37]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563ccbf26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd46d482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a1dae56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55796dd0ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #48: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #49: PyObject_Call + 0xbc (0x55bd46d61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: return func(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank38]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55796dd073e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #60: PyObject_Call + 0xbc (0x55635db31f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55784222ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd46d482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #30: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563ccbf16c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #63: PyObject_Call + 0xbc (0x560769b2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank38]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55796dd12a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55bd46d55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5565dabbd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank34]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55635db182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank55]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank55]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank37]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563ccbf26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55796dd02c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bd46d4e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55bd46d5fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #54: + 0x211239 (0x55bd46e22239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #45: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: frame #32: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #62: + 0x150582 (0x55635db31582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #37: _PyObject_Call_Prepend + 0x69 (0x557842241c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #55: PyObject_Call + 0x207 (0x55bd46d62067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd46d482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #57: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return func(*args, **kwargs) [default7]:[rank55]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa46cd36897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank37]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bd46d468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #59: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #60: PyObject_Call + 0xbc (0x55bd46d61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: frame #46: PyObject_Call + 0xbc (0x560b1b986f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #55: PyObject_Call + 0x207 (0x5629e09d9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #63: PyObject_Call + 0xbc (0x55635db31f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd46d482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #62: + 0x150582 (0x55bd46d61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #63: PyObject_Call + 0xbc (0x55bd46d61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank52]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2dd7b77897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank55]: frame #1: + 0x5b3a23e (0x7fa4a685323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55796dd12a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank57]: frame #48: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5629e09bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #57: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank57]: frame #49: PyObject_Call + 0xbc (0x5565dabd6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5629e09bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa4a684dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa4a684df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank55]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa4a684efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #45: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: frame #59: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #38: + 0x211239 (0x557842304239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #45: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560b1b96d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank49]: frame #34: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #46: PyObject_Call + 0xbc (0x563ccbf32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank52]: frame #1: + 0x5b3a23e (0x7f2e1169423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #46: PyObject_Call + 0xbc (0x55796dd1ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: pipeline_state.run_communication() [default5]:[rank53]: frame #60: PyObject_Call + 0xbc (0x5629e09d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563ccbf192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #39: _PyObject_MakeTpCall + 0x26b (0x557842230a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4a6803371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55796dd052b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5565dabbd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5565dabcaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5629e09bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #48: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55784222c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank53]: frame #62: + 0x150582 (0x5629e09d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #49: PyObject_Call + 0xbc (0x55796dd1ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: recv_activation_tensor = recv_activation() [default5]:[rank61]: frame #48: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #63: PyObject_Call + 0xbc (0x5629e09d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4a6803371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #41: _PyFunction_Vectorcall + 0x6c (0x557842237a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #49: PyObject_Call + 0xbc (0x560b1b986f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd16d694897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank55]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4a6803371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f2e1168ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55796dd052b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #48: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #1: + 0x5b3a23e (0x7fd1a71b123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55796dd12a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560b1b96d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4a6803371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: frame #49: PyObject_Call + 0xbc (0x563ccbf32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557842227c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55796dd0b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fd1a71abc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #43: _PyFunction_Vectorcall + 0x6c (0x557842237a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55796dd1cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd1a71abf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #54: + 0x211239 (0x55796dddf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563ccbf192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd1a71acfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #55: PyObject_Call + 0x207 (0x55796dd1f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560b1b97aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank39]: frame #45: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: return forward_call(*args, **kwargs) [default7]:[rank55]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa46e010189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2e1168ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2e1168ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55796dd052b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #57: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5565dabc3007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a1dae4ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #46: PyObject_Call + 0xbc (0x557842243f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563ccbf26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5565dabd4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd1a7161371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a1dae60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55796dd038fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa46e017610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #59: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e11644371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55784222a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e11644371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #38: + 0x211239 (0x55a1daf23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a1dae4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563ccbf1f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #60: PyObject_Call + 0xbc (0x55796dd1ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55796dd052b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #62: + 0x150582 (0x55796dd1e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #63: PyObject_Call + 0xbc (0x55796dd1ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd1a7161371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e11644371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa46e036978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a1dae4b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank39]: frame #48: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #53: _PyObject_Call_Prepend + 0x69 (0x563ccbf30c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a1dae56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #49: PyObject_Call + 0xbc (0x557842243f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2e11644371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #12: + 0x5adc309 (0x7fa4a67f5309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #13: + 0x5ae6f10 (0x7fa4a67fff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55784222a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2dd8e51189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #51: _PyFunction_Vectorcall + 0x6c (0x557842237a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2dd8e58610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x557842230007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a1dae46c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a1dae56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #54: + 0x211239 (0x563ccbff3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #55: PyObject_Call + 0x207 (0x563ccbf33067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank63]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #14: + 0x5ae6fa5 (0x7fa4a67fffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f2dd8e77978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #53: _PyObject_Call_Prepend + 0x69 (0x557842241c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank52]: frame #12: + 0x5adc309 (0x7f2e11636309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #54: + 0x211239 (0x557842304239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #55: PyObject_Call + 0x207 (0x557842244067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank62]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd1a7161371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #15: + 0x5124446 (0x7fa4a5e3d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55784222a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #57: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7223f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank49]: frame #45: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #16: + 0x1acf4b8 (0x7fa4a27e84b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5578422288fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #59: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #54: + 0x211239 (0x5565dac97239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #55: PyObject_Call + 0x207 (0x5565dabd7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #17: + 0x5aee004 (0x7fa4a6807004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #60: PyObject_Call + 0xbc (0x557842243f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55784222a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #62: + 0x150582 (0x557842243582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5565dabbd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #46: PyObject_Call + 0xbc (0x55a1dae62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #63: PyObject_Call + 0xbc (0x557842243f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank58]: dist.recv( [default4]:[rank52]: frame #13: + 0x5ae6f10 (0x7f2e11640f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563ccbf192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #57: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563ccbf178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #59: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #1: + 0x5b3a23e (0x7f4aabd5c23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #18: + 0x5af36b5 (0x7fa4a680c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #60: PyObject_Call + 0xbc (0x563ccbf32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563ccbf192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #62: + 0x150582 (0x563ccbf32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank63]: pipeline_state.run_communication() [default7]:[rank55]: frame #19: + 0xd2631e (0x7fa4b93f631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank37]: frame #63: PyObject_Call + 0xbc (0x563ccbf32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: return func(*args, **kwargs) [default7]:[rank55]: frame #20: + 0x47def4 (0x7fa4b8b4def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd1a7161371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd16e96e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: Traceback (most recent call last): [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank55]: frame #21: + 0x1445a6 (0x560aabbb65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f4aabd56c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560aabbafa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: Traceback (most recent call last): [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank49]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1dae492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: frame #48: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: trainer.train(dataloader) [default0]:[rank48]: trainer.train(dataloader) [default5]:[rank61]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560b1b973007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #57: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #23: + 0x150866 (0x560aabbc2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5565dabbb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank52]: frame #14: + 0x5ae6fa5 (0x7f2e11640fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #15: + 0x5124446 (0x7f2e10c7e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank61]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560b1b984c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560aabbab142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #16: + 0x1acf4b8 (0x7f2e0d6294b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #59: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #17: + 0x5aee004 (0x7f2e11648004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank58]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank56]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: frame #18: + 0x5af36b5 (0x7f2e1164d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank56]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank55]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560aabbb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank59]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4aabd56f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #60: PyObject_Call + 0xbc (0x5565dabd6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fd16e975610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fd16e994978 in /fsx/ferdinandmom/miniforge3/envs/[default0]:[rank48]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank49]: frame #49: PyObject_Call + 0xbc (0x55a1dae62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #26: PyObject_Call + 0xbc (0x560aabbc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4aabd57fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank56]: dist.recv( [default1]:[rank57]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5565dabbd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #54: + 0x211239 (0x560b1ba47239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #55: PyObject_Call + 0x207 (0x560b1b987067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1dae492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: frame #62: + 0x150582 (0x5565dabd6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #63: PyObject_Call + 0xbc (0x5565dabd6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank55]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560aabba92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560aabbb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560b1b96d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank56]: return func(*args, **kwargs) [default7]:[rank63]: recv_activation_tensor = recv_activation() [default4]:[rank52]: frame #19: + 0xd2631e (0x7f2e2423731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #12: + 0x5adc309 (0x7fd1a7153309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #13: + 0x5ae6f10 (0x7fd1a715df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1916ee1897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank48]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a1dae56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: frame #30: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank52]: frame #20: + 0x47def4 (0x7f2e2398eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4aabd0c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank55]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #32: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #21: + 0x1445a6 (0x559f8ea555a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #1: + 0x5b3a23e (0x7f19509fe23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank56]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: frame #57: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: output = model(**micro_batch) [default3]:[rank59]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4aabd0c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559f8ea4ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #23: + 0x150866 (0x559f8ea61866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4aabd0c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f19509f8c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559f8ea4a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f19509f8f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank63]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559f8ea55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #14: + 0x5ae6fa5 (0x7fd1a715dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a1dae4f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a1dae60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f19509f9fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f19509ae371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f19509ae371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #54: + 0x211239 (0x55a1daf23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #26: PyObject_Call + 0xbc (0x559f8ea61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560b1b96b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: frame #59: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank49]: frame #55: PyObject_Call + 0x207 (0x55a1dae63067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4aabd0c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #34: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4a73519189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank58]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f19509ae371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1dae492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #60: PyObject_Call + 0xbc (0x560b1b986f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #57: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #15: + 0x5124446 (0x7fd1a679b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: dist.recv( [default6]:[rank62]: frame #16: + 0x1acf4b8 (0x7fd1a31464b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560aabbaef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559f8ea482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #17: + 0x5aee004 (0x7fd1a7165004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #18: + 0x5af36b5 (0x7fd1a716a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f19509ae371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: return forward_call(*args, **kwargs) [default2]:[rank58]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f19181bb189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4a73520610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: return func(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank56]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank56]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560b1b96d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank63]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank59]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f4a7353f978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5b312c4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank56]: frame #1: + 0x5b3a23e (0x7f5b6ade123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559f8ea55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank62]: frame #19: + 0xd2631e (0x7fd1b9d5431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank52]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #62: + 0x150582 (0x560b1b986582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank62]: frame #20: + 0x47def4 (0x7fd1b94abef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: frame #21: + 0x1445a6 (0x555cf1a4f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #30: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555cf1a48a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #63: PyObject_Call + 0xbc (0x560b1b986f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: output = model(**micro_batch) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f5b6addbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5b6addbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560aabbc0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5b6addcfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdf90954897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank63]: frame #1: + 0x5b3a23e (0x7fdfca47123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #23: + 0x150866 (0x555cf1a5b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a1dae478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #59: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank59]: frame #12: + 0x5adc309 (0x7f4aabcfe309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #32: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #13: + 0x5ae6f10 (0x7f4aabd08f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555cf1a44142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555cf1a4fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #60: PyObject_Call + 0xbc (0x55a1dae62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #26: PyObject_Call + 0xbc (0x555cf1a5bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf1a422b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #38: + 0x211239 (0x560aabc83239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560aabbafa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a1dae492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #34: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560aabbab3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: sharded_logits = self.model( [default7]:[rank55]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560aabbb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555cf1a4fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560aabba6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b6ad91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: frame #14: + 0x5ae6fa5 (0x7f4aabd08fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #15: + 0x5124446 (0x7f4aab346446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #16: + 0x1acf4b8 (0x7f4aa7cf14b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559f8ea4df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560aabbb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f19181c2610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559f8ea5fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #30: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return forward_call(*args, **kwargs) [default3]:[rank59]: frame #17: + 0x5aee004 (0x7f4aabd10004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #18: + 0x5af36b5 (0x7f4aabd156b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #45: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #46: PyObject_Call + 0xbc (0x560aabbc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fdfca46bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #38: + 0x211239 (0x559f8eb22239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559f8ea4ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdfca46bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: sharded_logits = self.model( [default2]:[rank58]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f19181e1978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #32: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdfca46cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560aabba92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #48: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #12: + 0x5adc309 (0x7f19509a0309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #34: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #62: + 0x150582 (0x55a1dae62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #63: PyObject_Call + 0xbc (0x55a1dae62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #13: + 0x5ae6f10 (0x7f19509aaf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank55]: frame #49: PyObject_Call + 0xbc (0x560aabbc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560aabba92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfca421371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555cf1a47f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555cf1a59c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #19: + 0xd2631e (0x7f4abe8ff31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #14: + 0x5ae6fa5 (0x7f19509aafa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #15: + 0x5124446 (0x7f194ffe8446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b6ad91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b6ad91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #16: + 0x1acf4b8 (0x7f194c9934b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #17: + 0x5aee004 (0x7f19509b2004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #18: + 0x5af36b5 (0x7f19509b76b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #38: + 0x211239 (0x555cf1b1c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555cf1a48a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfca421371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5b6ad91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5b3259e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559f8ea4a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559f8ea55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank62]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555cf1a443e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555cf1a4fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559f8ea45c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #20: + 0x47def4 (0x7f4abe056ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #19: + 0xd2631e (0x7f19635a131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #20: + 0x47def4 (0x7f1962cf8ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank52]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559f8ea55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #21: + 0x1445a6 (0x5567a49675a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5567a4960a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank58]: frame #21: + 0x1445a6 (0x5635aaaae5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #23: + 0x150866 (0x5567a4973866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5567a495c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560aabbb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560aabbaf007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555cf1a3fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfca421371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank62]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555cf1a4fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5567a4967a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: frame #26: PyObject_Call + 0xbc (0x5567a4973f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdfca421371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #45: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #46: PyObject_Call + 0xbc (0x559f8ea61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5635aaaa7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5567a495a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank63]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fdf91c2e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fdf91c35610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fdf91c54978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f5b325a5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f5b325c4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559f8ea482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #23: + 0x150866 (0x5635aaaba866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5635aaaa3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #48: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5635aaaaea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #12: + 0x5adc309 (0x7f5b6ad83309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #13: + 0x5ae6f10 (0x7f5b6ad8df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: frame #49: PyObject_Call + 0xbc (0x559f8ea61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559f8ea482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559f8ea55a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5567a4967a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank63]: frame #12: + 0x5adc309 (0x7fdfca413309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #13: + 0x5ae6f10 (0x7fdfca41df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560aabbc0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #14: + 0x5ae6fa5 (0x7f5b6ad8dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #54: + 0x211239 (0x560aabc83239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #15: + 0x5124446 (0x7f5b6a3cb446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #26: PyObject_Call + 0xbc (0x5635aaabaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank63]: frame #14: + 0x5ae6fa5 (0x7fdfca41dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #15: + 0x5124446 (0x7fdfc9a5b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5635aaaa12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #16: + 0x1acf4b8 (0x7f5b66d764b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: frame #17: + 0x5aee004 (0x7f5b6ad95004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #16: + 0x1acf4b8 (0x7fdfc64064b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #17: + 0x5aee004 (0x7fdfca425004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559f8ea4e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #18: + 0x5af36b5 (0x7f5b6ad9a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #30: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #18: + 0x5af36b5 (0x7fdfca42a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559f8ea5fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #54: + 0x211239 (0x559f8eb22239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: frame #55: PyObject_Call + 0x207 (0x559f8ea62067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559f8ea482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #57: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default2]:[rank58]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5635aaaaea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #55: PyObject_Call + 0x207 (0x560aabbc3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560aabba92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #19: + 0xd2631e (0x7fdfdd01431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: frame #20: + 0x47def4 (0x7fdfdc76bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #19: + 0xd2631e (0x7f5b7d98431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #30: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #32: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #45: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #21: + 0x1445a6 (0x55a50852e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a508527a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #57: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: frame #20: + 0x47def4 (0x7f5b7d0dbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank56]: frame #21: + 0x1445a6 (0x55b0ba17c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: return forward_call(*args, **kwargs) [default0]:[rank48]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank62]: frame #46: PyObject_Call + 0xbc (0x555cf1a5bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559f8ea468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560aabba78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf1a422b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank62]: frame #48: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #49: PyObject_Call + 0xbc (0x555cf1a5bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #59: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #59: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #60: PyObject_Call + 0xbc (0x560aabbc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #60: PyObject_Call + 0xbc (0x559f8ea61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: pipeline_state.run_communication() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560aabba92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #62: + 0x150582 (0x560aabbc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559f8ea482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank48]: recv_activation_tensor = recv_activation() [default7]:[rank63]: frame #23: + 0x150866 (0x55a50853a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a508523142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #62: + 0x150582 (0x559f8ea61582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #63: PyObject_Call + 0xbc (0x559f8ea61f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a50852ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf1a422b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank62]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555cf1a4fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #63: PyObject_Call + 0xbc (0x560aabbc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #26: PyObject_Call + 0xbc (0x55a50853af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b0ba175a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #23: + 0x150866 (0x55b0ba188866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank56]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b0ba171142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank59]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #34: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank62]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555cf1a48007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank59]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a5085212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a50852ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank59]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5567a495ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5567a4971c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #32: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank54]: pipeline_state.run_communication() [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank48]: dist.recv( [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank48]: return func(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: recv_activation_tensor = recv_activation() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank63]: frame #30: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank63]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b0ba17ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank48]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank58]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #34: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: frame #32: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feabaec8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank63]: frame #34: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank58]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #1: + 0x5b3a23e (0x7feaf49e523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5635aaaa6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank62]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555cf1a59c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7feaf49dfc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7feaf49dff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #38: + 0x211239 (0x5567a4a34239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7feaf49e0fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #26: PyObject_Call + 0xbc (0x55b0ba188f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7feaf4995371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #54: + 0x211239 (0x555cf1b1c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank62]: frame #55: PyObject_Call + 0x207 (0x555cf1a5c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5567a4960a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5567a495c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7feaf4995371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf1a422b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: dist.recv( [default0]:[rank48]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7feaf4995371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank54]: return func(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7feaf4995371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank48]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7feabc1a2189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank58]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5635aaab8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5567a4967a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5567a4957c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7feabc1a9610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ba16f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank59]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5567a4967a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7feabc1c8978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #57: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8445f61897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: frame #38: + 0x211239 (0x5635aab7b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #12: + 0x5adc309 (0x7feaf4987309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a508526f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #13: + 0x5ae6f10 (0x7feaf4991f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #1: + 0x5b3a23e (0x7f847fa7e23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a508538c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f847fa78c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #45: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #46: PyObject_Call + 0xbc (0x5567a4973f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f847fa78f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f847fa79fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555cf1a408fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f847fa2e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f847fa2e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b0ba17ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f847fa2e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #14: + 0x5ae6fa5 (0x7feaf4991fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5567a495a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #48: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f847fa2e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #30: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #15: + 0x5124446 (0x7feaf3fcf446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5635aaaa7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #59: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #49: PyObject_Call + 0xbc (0x5567a4973f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5567a495a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #16: + 0x1acf4b8 (0x7feaf097a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #17: + 0x5aee004 (0x7feaf4999004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #18: + 0x5af36b5 (0x7feaf499e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5635aaaa33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5635aaaaea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5635aaa9ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #19: + 0xd2631e (0x7feb0758831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank48]: frame #20: + 0x47def4 (0x7feb06cdfef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank54]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f844723b189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #21: + 0x1445a6 (0x55c2575025a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c2574fba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #23: + 0x150866 (0x55c25750e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8447242610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c2574f7142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c257502a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #60: PyObject_Call + 0xbc (0x555cf1a5bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #26: PyObject_Call + 0xbc (0x55c25750ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #32: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c2574f52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c257502a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5567a4967a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5567a4960007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f8447261978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5567a4971c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #12: + 0x5adc309 (0x7f847fa20309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #30: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #34: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5635aaaaea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #13: + 0x5ae6f10 (0x7f847fa2af10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #32: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555cf1a422b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #34: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #54: + 0x211239 (0x5567a4a34239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c2574faf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c25750cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #14: + 0x5ae6fa5 (0x7f847fa2afa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #15: + 0x5124446 (0x7f847f068446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #38: + 0x211239 (0x55a5085fb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a508527a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #16: + 0x1acf4b8 (0x7f847ba134b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #45: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #46: PyObject_Call + 0xbc (0x5635aaabaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #17: + 0x5aee004 (0x7f847fa32004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b0ba174f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #18: + 0x5af36b5 (0x7f847fa376b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #19: + 0xd2631e (0x7f849262131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: frame #55: PyObject_Call + 0x207 (0x5567a4974067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #38: + 0x211239 (0x55c2575cf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5567a495a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #62: + 0x150582 (0x555cf1a5b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c2574fba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c2574f73e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #63: PyObject_Call + 0xbc (0x555cf1a5bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b0ba186c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c257502a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #38: + 0x211239 (0x55b0ba249239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c2574f2c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #20: + 0x47def4 (0x7f8491d78ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5635aaaa12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c257502a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #48: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #21: + 0x1445a6 (0x559f35c1d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b0ba175a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #45: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #46: PyObject_Call + 0xbc (0x55c25750ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c2574f52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #48: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559f35c16a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #49: PyObject_Call + 0xbc (0x55c25750ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #23: + 0x150866 (0x559f35c29866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559f35c12142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c2574f52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a5085233e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a50852ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c257502a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559f35c1da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b0ba1713e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c2574fb007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c25750cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #49: PyObject_Call + 0xbc (0x5635aaabaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5635aaaa12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #54: + 0x211239 (0x55c2575cf239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #26: PyObject_Call + 0xbc (0x559f35c29f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank54]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559f35c102b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b0ba17ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559f35c1da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5635aaaaea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5635aaaa7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a50851ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #55: PyObject_Call + 0x207 (0x55c25750f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a50852ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c2574f52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5635aaab8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #54: + 0x211239 (0x5635aab7b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #30: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b0ba16cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: Traceback (most recent call last): [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: frame #57: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c2574f38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b0ba17ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: trainer.train(dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: frame #59: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #60: PyObject_Call + 0xbc (0x55c25750ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #57: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5567a49588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c2574f52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: Traceback (most recent call last): [default3]:[rank59]: frame #59: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank59]: frame #60: PyObject_Call + 0xbc (0x5567a4973f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank51]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank56]: frame #45: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: trainer.train(dataloader) [default2]:[rank58]: frame #55: PyObject_Call + 0x207 (0x5635aaabb067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5635aaaa12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #32: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #45: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank50]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank58]: frame #57: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5635aaa9f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #62: + 0x150582 (0x55c25750e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #46: PyObject_Call + 0xbc (0x55b0ba188f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank63]: frame #46: PyObject_Call + 0xbc (0x55a50853af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a5085212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: frame #48: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #63: PyObject_Call + 0xbc (0x55c25750ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #59: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #60: PyObject_Call + 0xbc (0x5635aaabaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank51]: output = model(**micro_batch) [default2]:[rank50]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank54]: frame #34: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559f35c15f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #49: PyObject_Call + 0xbc (0x55a50853af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559f35c27c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a5085212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a50852ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #38: + 0x211239 (0x559f35cea239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ba16f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559f35c16a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5567a495a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #62: + 0x150582 (0x5567a4973582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: return forward_call(*args, **kwargs) [default0]:[rank56]: frame #48: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a508527007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a508538c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #49: PyObject_Call + 0xbc (0x55b0ba188f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ba16f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559f35c123e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559f35c1da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b0ba17ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #54: + 0x211239 (0x55a5085fb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: sharded_logits = self.model( [default7]:[rank63]: frame #55: PyObject_Call + 0x207 (0x55a50853b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559f35c0dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b0ba175007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559f35c1da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a5085212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #57: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default7]:[rank63]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a50851f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5635aaaa12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #62: + 0x150582 (0x5635aaaba582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #45: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b0ba186c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank63]: frame #59: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #60: PyObject_Call + 0xbc (0x55a50853af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: frame #46: PyObject_Call + 0xbc (0x559f35c29f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559f35c102b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #63: PyObject_Call + 0xbc (0x5635aaabaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #48: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank59]: frame #63: PyObject_Call + 0xbc (0x5567a4973f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: output = model(**micro_batch) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: frame #54: + 0x211239 (0x55b0ba249239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default3]:[rank59]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank63]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a5085212b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #49: PyObject_Call + 0xbc (0x559f35c29f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #62: + 0x150582 (0x55a50853a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: return forward_call(*args, **kwargs) [default7]:[rank63]: frame #63: PyObject_Call + 0xbc (0x55a50853af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank63]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank54]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559f35c102b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559f35c1da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank50]: sharded_logits = self.model( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default3]:[rank51]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank56]: frame #55: PyObject_Call + 0x207 (0x55b0ba189067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ba16f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559f35c16007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #57: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b0ba16d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #59: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: frame #60: PyObject_Call + 0xbc (0x55b0ba188f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b0ba16f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #62: + 0x150582 (0x55b0ba188582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #63: PyObject_Call + 0xbc (0x55b0ba188f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank54]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559f35c27c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #54: + 0x211239 (0x559f35cea239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank50]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default6]:[rank54]: frame #55: PyObject_Call + 0x207 (0x559f35c2a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559f35c102b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank54]: frame #57: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559f35c0e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #59: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #60: PyObject_Call + 0xbc (0x559f35c29f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: pipeline_state.run_communication() [default6]:[rank54]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559f35c102b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #62: + 0x150582 (0x559f35c29582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #63: PyObject_Call + 0xbc (0x559f35c29f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank51]: recv_activation_tensor = recv_activation() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank50]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank50]: pipeline_state.run_communication() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank50]: recv_activation_tensor = recv_activation() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank50]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank50]: dist.recv( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank51]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank51]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: return func(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank51]: dist.recv( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank50]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank51]: return func(*args, **kwargs) [default2]:[rank50]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4dd8a87897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank50]: frame #1: + 0x5b3a23e (0x7f4e125a423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank51]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank51]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6571952897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank51]: frame #1: + 0x5b3a23e (0x7f65ab46f23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f65ab469c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f4e1259ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f65ab469f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f65ab46afd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f65ab41f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f65ab41f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f65ab41f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4e1259ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f65ab41f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6572c2c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6572c33610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4e1259ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e12554371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6572c52978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #12: + 0x5adc309 (0x7f65ab411309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #13: + 0x5ae6f10 (0x7f65ab41bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e12554371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e12554371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4e12554371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #14: + 0x5ae6fa5 (0x7f65ab41bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #15: + 0x5124446 (0x7f65aaa59446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #16: + 0x1acf4b8 (0x7f65a74044b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #17: + 0x5aee004 (0x7f65ab423004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #18: + 0x5af36b5 (0x7f65ab4286b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #19: + 0xd2631e (0x7f65be01231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4dd9d61189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #20: + 0x47def4 (0x7f65bd769ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4dd9d68610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f4dd9d87978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #21: + 0x1445a6 (0x55f8ca2f05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #12: + 0x5adc309 (0x7f4e12546309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f8ca2e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #13: + 0x5ae6f10 (0x7f4e12550f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #23: + 0x150866 (0x55f8ca2fc866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #14: + 0x5ae6fa5 (0x7f4e12550fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #15: + 0x5124446 (0x7f4e11b8e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f8ca2e5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f8ca2f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #26: PyObject_Call + 0xbc (0x55f8ca2fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #16: + 0x1acf4b8 (0x7f4e0e5394b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f8ca2e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f8ca2f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #30: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #17: + 0x5aee004 (0x7f4e12558004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #32: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #18: + 0x5af36b5 (0x7f4e1255d6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #19: + 0xd2631e (0x7f4e2514731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #34: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #20: + 0x47def4 (0x7f4e2489eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #21: + 0x1445a6 (0x55bcb06995a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55bcb0692a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #23: + 0x150866 (0x55bcb06a5866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bcb068e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f8ca2e8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55bcb0699a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #26: PyObject_Call + 0xbc (0x55bcb06a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bcb068c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f8ca2fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55bcb0699a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #38: + 0x211239 (0x55f8ca3bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f8ca2e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #30: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f8ca2e53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f8ca2f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f8ca2e0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f8ca2f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #45: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #46: PyObject_Call + 0xbc (0x55f8ca2fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f8ca2e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #48: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #49: PyObject_Call + 0xbc (0x55f8ca2fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f8ca2e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f8ca2f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #32: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #34: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f8ca2e9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f8ca2fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #54: + 0x211239 (0x55f8ca3bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #55: PyObject_Call + 0x207 (0x55f8ca2fd067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f8ca2e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bcb0691f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #57: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f8ca2e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #59: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #60: PyObject_Call + 0xbc (0x55f8ca2fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55bcb06a3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #38: + 0x211239 (0x55bcb0766239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55bcb0692a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f8ca2e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #62: + 0x150582 (0x55f8ca2fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #63: PyObject_Call + 0xbc (0x55f8ca2fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank50]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bcb068e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55bcb0699a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bcb0689c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55bcb0699a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #45: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #46: PyObject_Call + 0xbc (0x55bcb06a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bcb068c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #48: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #49: PyObject_Call + 0xbc (0x55bcb06a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bcb068c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55bcb0699a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bcb0692007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55bcb06a3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #54: + 0x211239 (0x55bcb0766239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #55: PyObject_Call + 0x207 (0x55bcb06a6067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bcb068c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #57: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bcb068a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #59: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #60: PyObject_Call + 0xbc (0x55bcb06a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bcb068c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #62: + 0x150582 (0x55bcb06a5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #63: PyObject_Call + 0xbc (0x55bcb06a5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: Traceback (most recent call last): [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: trainer.train(dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank60]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank60]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank60]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank60]: output = model(**micro_batch) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank60]: sharded_logits = self.model( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank60]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank60]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank60]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank60]: pipeline_state.run_communication() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank60]: recv_activation_tensor = recv_activation() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank60]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank60]: dist.recv( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: return func(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank60]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank60]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank60]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efc4a058897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank60]: frame #1: + 0x5b3a23e (0x7efc83b7523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7efc83b6fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7efc83b6ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7efc83b70fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efc83b25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efc83b25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efc83b25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efc83b25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7efc4b332189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7efc4b339610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7efc4b358978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #12: + 0x5adc309 (0x7efc83b17309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #13: + 0x5ae6f10 (0x7efc83b21f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #14: + 0x5ae6fa5 (0x7efc83b21fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #15: + 0x5124446 (0x7efc8315f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #16: + 0x1acf4b8 (0x7efc7fb0a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #17: + 0x5aee004 (0x7efc83b29004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #18: + 0x5af36b5 (0x7efc83b2e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #19: + 0xd2631e (0x7efc9671831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank60]: frame #20: + 0x47def4 (0x7efc95e6fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank60]: frame #21: + 0x1445a6 (0x55876d20a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55876d203a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #23: + 0x150866 (0x55876d216866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55876d1ff142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55876d20aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #26: PyObject_Call + 0xbc (0x55876d216f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55876d1fd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55876d20aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #30: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #32: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #34: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55876d202f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55876d214c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #38: + 0x211239 (0x55876d2d7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55876d203a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55876d1ff3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55876d20aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55876d1fac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55876d20aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #45: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #46: PyObject_Call + 0xbc (0x55876d216f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55876d1fd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #48: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #49: PyObject_Call + 0xbc (0x55876d216f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55876d1fd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55876d20aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55876d203007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55876d214c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #54: + 0x211239 (0x55876d2d7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #55: PyObject_Call + 0x207 (0x55876d217067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55876d1fd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #57: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55876d1fb8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #59: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #60: PyObject_Call + 0xbc (0x55876d216f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55876d1fd2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #62: + 0x150582 (0x55876d216582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #63: PyObject_Call + 0xbc (0x55876d216f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: . This may indicate a possible application crash on rank 0 or a network set up issue. W0703 09:43:22.751000 140380479002432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 567141 closing signal SIGTERM E0703 09:43:23.193000 140380479002432 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 567140) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 567142) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 567143) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 567144) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 567145) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 567146) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 567147) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:43:22 host : ip-26-0-169-139.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 567140) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-169-139: task 0: Exited with exit code 1 W0703 09:43:26.817000 140534699820800 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-170-31.ec2.internal_3096567_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.057000 139786108692224 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-56.ec2.internal_3427805_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.287000 139836089911040 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-247.ec2.internal_36615_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.305000 139681971750656 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-239.ec2.internal_2556242_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.430000 140054893676288 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3976586_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.479000 139998861629184 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-207.ec2.internal_2584734_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.733000 139788562155264 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_964669_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.771000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584817 closing signal SIGTERM W0703 09:43:27.771000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584818 closing signal SIGTERM W0703 09:43:27.772000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584819 closing signal SIGTERM W0703 09:43:27.772000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584820 closing signal SIGTERM W0703 09:43:27.772000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584821 closing signal SIGTERM W0703 09:43:27.772000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2584822 closing signal SIGTERM W0703 09:43:27.771000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427879 closing signal SIGTERM W0703 09:43:27.771000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427880 closing signal SIGTERM W0703 09:43:27.771000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427881 closing signal SIGTERM W0703 09:43:27.771000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2556318 closing signal SIGTERM W0703 09:43:27.771000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2556320 closing signal SIGTERM W0703 09:43:27.771000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2556321 closing signal SIGTERM W0703 09:43:27.772000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2556322 closing signal SIGTERM W0703 09:43:27.772000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2556324 closing signal SIGTERM W0703 09:43:27.773000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427882 closing signal SIGTERM W0703 09:43:27.773000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427883 closing signal SIGTERM W0703 09:43:27.774000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427884 closing signal SIGTERM W0703 09:43:27.774000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427885 closing signal SIGTERM W0703 09:43:27.774000 139791769425728 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3427886 closing signal SIGTERM W0703 09:43:27.773000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096641 closing signal SIGTERM W0703 09:43:27.773000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096642 closing signal SIGTERM W0703 09:43:27.773000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096643 closing signal SIGTERM W0703 09:43:27.774000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096644 closing signal SIGTERM W0703 09:43:27.774000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096645 closing signal SIGTERM W0703 09:43:27.776000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096646 closing signal SIGTERM W0703 09:43:27.776000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096647 closing signal SIGTERM W0703 09:43:27.777000 140540360554304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3096648 closing signal SIGTERM W0703 09:43:27.778000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976663 closing signal SIGTERM W0703 09:43:27.779000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976664 closing signal SIGTERM W0703 09:43:27.779000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976665 closing signal SIGTERM W0703 09:43:27.780000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976666 closing signal SIGTERM W0703 09:43:27.780000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976667 closing signal SIGTERM W0703 09:43:27.781000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976668 closing signal SIGTERM W0703 09:43:27.782000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976669 closing signal SIGTERM W0703 09:43:27.784000 140060554409792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3976670 closing signal SIGTERM E0703 09:43:27.807000 139794222888768 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 964745) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 09:43:27.814000 139794222888768 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_964669_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.840000 139794222888768 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_964669_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.868000 139794222888768 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_964669_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 57 (local_rank: 1) exitcode : 1 (pid: 964746) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 58 (local_rank: 2) exitcode : 1 (pid: 964747) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 59 (local_rank: 3) exitcode : 1 (pid: 964748) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 60 (local_rank: 4) exitcode : 1 (pid: 964749) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 61 (local_rank: 5) exitcode : 1 (pid: 964750) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 62 (local_rank: 6) exitcode : 1 (pid: 964751) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 63 (local_rank: 7) exitcode : 1 (pid: 964752) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:43:27 host : ip-26-0-171-88.ec2.internal rank : 56 (local_rank: 0) exitcode : 1 (pid: 964745) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0703 09:43:27.900000 139841750644544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 36688) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 09:43:27.906000 139841750644544 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-247.ec2.internal_36615_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.941000 139841750644544 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-247.ec2.internal_36615_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:27.975000 139841750644544 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-247.ec2.internal_36615_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 25 (local_rank: 1) exitcode : 1 (pid: 36689) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 26 (local_rank: 2) exitcode : 1 (pid: 36690) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 27 (local_rank: 3) exitcode : 1 (pid: 36691) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 28 (local_rank: 4) exitcode : 1 (pid: 36692) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 29 (local_rank: 5) exitcode : 1 (pid: 36693) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 36694) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 36695) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:43:27 host : ip-26-0-169-247.ec2.internal rank : 24 (local_rank: 0) exitcode : 1 (pid: 36688) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-171-88: task 7: Exited with exit code 1 srun: error: ip-26-0-169-247: task 3: Exited with exit code 1 E0703 09:43:29.201000 139687632484160 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2556317) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 09:43:29.207000 139687632484160 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-239.ec2.internal_2556242_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:29.235000 139687632484160 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-239.ec2.internal_2556242_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:29.249000 139687632484160 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-239.ec2.internal_2556242_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:43:27 host : ip-26-0-169-239.ec2.internal rank : 18 (local_rank: 2) exitcode : 1 (pid: 2556319) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_09:43:27 host : ip-26-0-169-239.ec2.internal rank : 22 (local_rank: 6) exitcode : 1 (pid: 2556323) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:43:27 host : ip-26-0-169-239.ec2.internal rank : 16 (local_rank: 0) exitcode : 1 (pid: 2556317) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0703 09:43:29.296000 140004522362688 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2584816) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 09:43:29.302000 140004522362688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-207.ec2.internal_2584734_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:29.330000 140004522362688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-207.ec2.internal_2584734_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:29.345000 140004522362688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-207.ec2.internal_2584734_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_09:43:27 host : ip-26-0-169-207.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 2584823) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_09:43:27 host : ip-26-0-169-207.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 2584816) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-169-207: task 1: Exited with exit code 1 srun: error: ip-26-0-169-239: task 2: Exited with exit code 1 W0703 09:43:31.821000 140534699820800 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-170-31.ec2.internal_3096567_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:31.906000 140540360554304 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-170-31.ec2.internal_3096567_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:31.917000 140540360554304 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-170-31.ec2.internal_3096567_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 09:43:32.061000 139786108692224 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-56.ec2.internal_3427805_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. srun: error: ip-26-0-170-31: task 4: Exited with exit code 1 W0703 09:43:32.435000 140054893676288 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3976586_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:33.512000 139791769425728 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-56.ec2.internal_3427805_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:33.513000 140060554409792 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3976586_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 09:43:33.522000 139791769425728 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-56.ec2.internal_3427805_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent W0703 09:43:33.524000 140060554409792 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3976586_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-171-56: task 5: Exited with exit code 1 srun: error: ip-26-0-171-62: task 6: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.