======================== START TIME: Wed Jul 3 01:08:51 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 01:08:54.235000 140623575447360 torch/distributed/run.py:757] W0703 01:08:54.235000 140623575447360 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.235000 140623575447360 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.235000 140623575447360 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.234000 140376923293504 torch/distributed/run.py:757] W0703 01:08:54.234000 140376923293504 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.234000 140376923293504 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.234000 140376923293504 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 140007806928704 torch/distributed/run.py:757] W0703 01:08:54.237000 140007806928704 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 140007806928704 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.237000 140007806928704 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 139729402169152 torch/distributed/run.py:757] W0703 01:08:54.237000 139729402169152 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 139729402169152 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.237000 139729402169152 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 140357703345984 torch/distributed/run.py:757] W0703 01:08:54.237000 140357703345984 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 140357703345984 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.237000 140357703345984 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 139842836084544 torch/distributed/run.py:757] W0703 01:08:54.237000 139842836084544 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.237000 139842836084544 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.237000 139842836084544 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.240000 140138204567360 torch/distributed/run.py:757] W0703 01:08:54.240000 140138204567360 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.240000 140138204567360 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.240000 140138204567360 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.245000 140511105955648 torch/distributed/run.py:757] W0703 01:08:54.245000 140511105955648 torch/distributed/run.py:757] ***************************************** W0703 01:08:54.245000 140511105955648 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 01:08:54.245000 140511105955648 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 01:09:13 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config: [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: run='%date_%jobid', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: step=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: consumed_train_samples=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: benchmark_csv_path=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ignore_sanity_checks=True), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: parallelism=ParallelismArgs(dp=16, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp=2, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp=2, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pp_engine=, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_mode=, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tp_linear_async_communication=False, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: expert_parallel_size=1), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50258), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dtype=torch.bfloat16, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_revision=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokenizer_max_length=None), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoint_interval=100000, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: save_initial_state=False, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: resume_checkpoint_path=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: log_level_replica='info', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: iteration_step_info_interval=1), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: train_steps=20, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: micro_batch_size=32, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: batch_accumulation_per_replica=2, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: val_check_interval=-1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_val_batches=0, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: limit_test_batches=0), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta1=0.9, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: adam_beta2=0.95, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: torch_adam_is_fused=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: name='adamW'), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: zero_stage=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: weight_decay=0.01, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: clip_grad=1.0, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_steps=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_warmup_style='linear', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_style='linear', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_steps=19, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lr_decay_starting_step=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: min_decay_lr=1e-05)), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: start_training_step=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_splits='train', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hf_dataset_config_name=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: dataset_overwrite_cache=False, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: text_column_name='text'), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: seed=42, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_loading_workers=0))], [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-16_tp-2_pp-2_mbz-32')), [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: lighteval=None) [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Model Config: [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: eos_token_id=2, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_act='silu', [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: hidden_size=2048, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: initializer_range=0.02, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: intermediate_size=4096, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: is_llama_config=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: max_position_embeddings=4096, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_attention_heads=32, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_hidden_layers=24, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: num_key_value_heads=32, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pad_token_id=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: pretraining_tp=1, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rms_norm_eps=1e-05, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_scaling=None, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: rope_theta=10000.0, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: tie_word_embeddings=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: use_cache=True, [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: vocab_size=50258) [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Building model.. [default0]:07/03/2024 01:09:13 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Setting PP block ranks... [default1]:07/03/2024 01:09:25 [INFO|DP=4|PP=0|TP=1|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=4|PP=0|TP=0|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=8|PP=0|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=4|PP=1|TP=0|ip-26-0-171-102]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Total number of parameters: 1.21G (2313.02MiB) [default0]:07/03/2024 01:09:25 [INFO|DP=12|PP=0|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Local number of parameters: 345M (658.27MiB) [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Parametrizing model parameters using StandardParametrizator [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: Local number of parameters: 345M (658.27MiB) [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-225]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=2|PP=0|TP=1|ip-26-0-160-225]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=1|PP=1|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=2|PP=1|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-78]: Local number of parameters: 261M (498.24MiB) [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-78]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB [default0]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=12|PP=1|TP=0|ip-26-0-171-88]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=12|PP=1|TP=1|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=1|PP=1|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=8|PP=0|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=1|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-78]: Local number of parameters: 261M (498.24MiB) [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-78]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB [default1]:07/03/2024 01:09:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=2|PP=1|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=1|PP=0|TP=1|ip-26-0-160-225]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=2|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=4|PP=1|TP=1|ip-26-0-171-102]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=12|PP=0|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default1]:07/03/2024 01:09:25 [INFO|DP=8|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default0]:07/03/2024 01:09:25 [INFO|DP=8|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=14|PP=1|TP=1|ip-26-0-171-88]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=3|PP=0|TP=1|ip-26-0-160-225]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=14|PP=0|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=13|PP=1|TP=1|ip-26-0-171-88]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=13|PP=1|TP=0|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=3|PP=0|TP=0|ip-26-0-160-225]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=3|PP=1|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=14|PP=1|TP=0|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=3|PP=1|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=14|PP=0|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=13|PP=0|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=13|PP=0|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=11|PP=0|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=15|PP=1|TP=0|ip-26-0-171-88]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=9|PP=0|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=9|PP=0|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=11|PP=0|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=10|PP=0|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=15|PP=1|TP=1|ip-26-0-171-88]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=15|PP=0|TP=0|ip-26-0-161-153]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=10|PP=0|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=9|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=10|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default6]:07/03/2024 01:09:25 [INFO|DP=11|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=11|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=10|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. [default7]:07/03/2024 01:09:25 [INFO|DP=15|PP=0|TP=1|ip-26-0-161-153]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=9|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=5|PP=0|TP=1|ip-26-0-161-103]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=6|PP=1|TP=0|ip-26-0-171-102]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=5|PP=0|TP=0|ip-26-0-161-103]: No checkpoint path provided. [default2]:07/03/2024 01:09:25 [INFO|DP=5|PP=1|TP=0|ip-26-0-171-102]: No checkpoint path provided. [default4]:07/03/2024 01:09:25 [INFO|DP=6|PP=0|TP=0|ip-26-0-161-103]: No checkpoint path provided. [default3]:07/03/2024 01:09:25 [INFO|DP=5|PP=1|TP=1|ip-26-0-171-102]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=6|PP=1|TP=1|ip-26-0-171-102]: No checkpoint path provided. [default5]:07/03/2024 01:09:25 [INFO|DP=6|PP=0|TP=1|ip-26-0-161-103]: No checkpoint path provided. [default6]:07/03/2024 01:09:26 [INFO|DP=7|PP=1|TP=0|ip-26-0-171-102]: No checkpoint path provided. [default7]:07/03/2024 01:09:26 [INFO|DP=7|PP=1|TP=1|ip-26-0-171-102]: No checkpoint path provided. [default6]:07/03/2024 01:09:26 [INFO|DP=7|PP=0|TP=0|ip-26-0-161-103]: No checkpoint path provided. [default7]:07/03/2024 01:09:26 [INFO|DP=7|PP=0|TP=1|ip-26-0-161-103]: No checkpoint path provided. [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 0 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 1 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 2 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 3 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 4 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 5 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 6 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 7 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 8 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 9 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 10 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 11 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 12 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 13 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 14 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [ZeRO sharding] DP Rank 15 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/03/2024 01:09:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 01:09:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Using `datasets` library [default0]:07/03/2024 01:09:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 01:09:35 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Training Plan] There are 1 training stages [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: [Start training] datetime: 2024-07-03 01:09:35.816105 | mbs: 32 | grad_accum: 2 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 01:09:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-225]: Memory usage: 2071.11MiB. Peak allocated 2071.11MiB. Peak reserved: 2092.00MiB [default7]:07/03/2024 01:09:35 [WARNING|DP=3|PP=0|TP=1|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:35 [WARNING|DP=2|PP=0|TP=1|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:35 [WARNING|DP=0|PP=0|TP=1|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=13|PP=1|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=1|PP=1|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:35 [WARNING|DP=2|PP=1|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:35 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=1|PP=0|TP=1|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:35 [WARNING|DP=13|PP=0|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [WARNING|DP=8|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:35 [WARNING|DP=10|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:35 [WARNING|DP=11|PP=0|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [WARNING|DP=8|PP=0|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:35 [WARNING|DP=6|PP=1|TP=0|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [WARNING|DP=4|PP=1|TP=0|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:36 [WARNING|DP=4|PP=0|TP=1|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [WARNING|DP=4|PP=0|TP=0|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:36 [WARNING|DP=9|PP=0|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=9|PP=0|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:35 [WARNING|DP=14|PP=1|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:36 [WARNING|DP=15|PP=1|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:36 [WARNING|DP=5|PP=1|TP=1|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:35 [WARNING|DP=6|PP=1|TP=1|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:35 [WARNING|DP=7|PP=1|TP=0|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:35 [WARNING|DP=5|PP=1|TP=0|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:35 [WARNING|DP=5|PP=0|TP=0|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:35 [WARNING|DP=6|PP=0|TP=0|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:35 [WARNING|DP=12|PP=0|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=5|PP=0|TP=1|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:36 [WARNING|DP=14|PP=0|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:36 [WARNING|DP=12|PP=1|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:35 [WARNING|DP=15|PP=1|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:36 [WARNING|DP=1|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:35 [WARNING|DP=3|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:36 [WARNING|DP=8|PP=0|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:35 [WARNING|DP=3|PP=1|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:35 [WARNING|DP=11|PP=0|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:36 [WARNING|DP=1|PP=1|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:35 [WARNING|DP=3|PP=1|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:35 [WARNING|DP=10|PP=0|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:35 [WARNING|DP=13|PP=1|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:36 [WARNING|DP=14|PP=1|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:35 [WARNING|DP=2|PP=0|TP=0|ip-26-0-160-225]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:36 [WARNING|DP=14|PP=0|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:35 [WARNING|DP=4|PP=1|TP=1|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:35 [WARNING|DP=7|PP=1|TP=1|ip-26-0-171-102]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:36 [WARNING|DP=10|PP=0|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 01:09:35 [WARNING|DP=10|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:35 [WARNING|DP=9|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:36 [WARNING|DP=15|PP=0|TP=0|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 01:09:36 [WARNING|DP=13|PP=0|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:35 [WARNING|DP=11|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:35 [WARNING|DP=8|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:35 [WARNING|DP=11|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:36 [WARNING|DP=15|PP=0|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 01:09:35 [WARNING|DP=9|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 01:09:36 [WARNING|DP=12|PP=1|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 01:09:36 [WARNING|DP=7|PP=0|TP=0|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 01:09:36 [WARNING|DP=6|PP=0|TP=1|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 01:09:36 [WARNING|DP=12|PP=0|TP=1|ip-26-0-161-153]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 01:09:36 [WARNING|DP=7|PP=0|TP=1|ip-26-0-161-103]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank10]: output = self.pp_block(**new_kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank10]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank10]: output = self.o_proj(attention_output) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank10]: return row_linear( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default2]:[rank10]: out = differentiable_reduce_scatter_sum(out, group=group) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default2]:[rank10]: return DifferentiableReduceScatterSum.apply(tensor, group) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default2]:[rank10]: return super().apply(*args, **kwargs) # type: ignore[misc] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default2]:[rank10]: sharded_tensor = torch.empty( [default2]:[rank10]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 79.11 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: trainer.train(dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank26]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank26]: output = model(**micro_batch) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank26]: sharded_logits = self.model( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank26]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank26]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank26]: output = self.pp_block(**new_kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default2]:[rank26]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default2]:[rank26]: merged_states = self.gate_up_proj(hidden_states) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank26]: return self._call_impl(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank26]: return forward_call(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default2]:[rank26]: return column_linear( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default2]:[rank26]: return F.linear(input, weight, bias) [default2]:[rank26]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 119.94 MiB is free. Including non-PyTorch memory, this process has 79.20 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: trainer.train(dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank27]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank27]: output = model(**micro_batch) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank27]: sharded_logits = self.model( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank27]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank27]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank27]: output = self.pp_block(**new_kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank27]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default3]:[rank27]: merged_states = self.gate_up_proj(hidden_states) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank27]: return self._call_impl(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank27]: return forward_call(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default3]:[rank27]: return column_linear( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default3]:[rank27]: return F.linear(input, weight, bias) [default3]:[rank27]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 271.94 MiB is free. Including non-PyTorch memory, this process has 79.05 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank29]: Traceback (most recent call last): [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank29]: trainer.train(dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank29]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank29]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank29]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank29]: output = model(**micro_batch) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank29]: sharded_logits = self.model( [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank29]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank29]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank29]: output = self.pp_block(**new_kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default5]:[rank29]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]:[rank29]: merged_states = self.gate_up_proj(hidden_states) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank29]: return self._call_impl(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank29]: return forward_call(*args, **kwargs) [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default5]:[rank29]: return column_linear( [default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default5]:[rank29]: return F.linear(input, weight, bias) [default5]:[rank29]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 271.94 MiB is free. Including non-PyTorch memory, this process has 79.05 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank28]: Traceback (most recent call last): [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank28]: trainer.train(dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank28]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank28]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank28]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank28]: output = model(**micro_batch) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank28]: sharded_logits = self.model( [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank28]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank28]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank28]: output = self.pp_block(**new_kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default4]:[rank28]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default4]:[rank28]: merged_states = self.gate_up_proj(hidden_states) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank28]: return self._call_impl(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank28]: return forward_call(*args, **kwargs) [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default4]:[rank28]: return column_linear( [default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default4]:[rank28]: return F.linear(input, weight, bias) [default4]:[rank28]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 119.94 MiB is free. Including non-PyTorch memory, this process has 79.20 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank14]: output = self.pp_block(**new_kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default6]:[rank14]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank14]: output = self.o_proj(attention_output) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank14]: return row_linear( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default6]:[rank14]: out = differentiable_reduce_scatter_sum(out, group=group) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default6]:[rank14]: return DifferentiableReduceScatterSum.apply(tensor, group) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default6]:[rank14]: return super().apply(*args, **kwargs) # type: ignore[misc] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default6]:[rank14]: sharded_tensor = torch.empty( [default6]:[rank14]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 79.11 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank13]: output = self.pp_block(**new_kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank13]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank13]: output = self.o_proj(attention_output) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank13]: return row_linear( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default5]:[rank13]: out = differentiable_reduce_scatter_sum(out, group=group) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default5]:[rank13]: return DifferentiableReduceScatterSum.apply(tensor, group) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default5]:[rank13]: return super().apply(*args, **kwargs) # type: ignore[misc] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default5]:[rank13]: sharded_tensor = torch.empty( [default5]:[rank13]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 207.94 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank12]: output = self.pp_block(**new_kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default4]:[rank12]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default4]:[rank12]: output = self.o_proj(attention_output) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank12]: return row_linear( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default4]:[rank12]: out = differentiable_reduce_scatter_sum(out, group=group) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default4]:[rank12]: return DifferentiableReduceScatterSum.apply(tensor, group) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default4]:[rank12]: return super().apply(*args, **kwargs) # type: ignore[misc] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default4]:[rank12]: sharded_tensor = torch.empty( [default4]:[rank12]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 135.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank8]: output = self.pp_block(**new_kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank8]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 565, in forward [default0]:[rank8]: key_value_states = key_value_states.permute(1, 2, 0, 3, 4).contiguous() [default0]:[rank8]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank9]: output = self.pp_block(**new_kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank9]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank9]: output = self.o_proj(attention_output) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank9]: return row_linear( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default1]:[rank9]: out = differentiable_reduce_scatter_sum(out, group=group) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default1]:[rank9]: return DifferentiableReduceScatterSum.apply(tensor, group) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default1]:[rank9]: return super().apply(*args, **kwargs) # type: ignore[misc] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default1]:[rank9]: sharded_tensor = torch.empty( [default1]:[rank9]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 207.94 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:07/03/2024 01:09:41 [WARNING|DP=2|PP=1|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default1]:[rank1]: merged_states = self.gate_up_proj(hidden_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default1]:[rank1]: return column_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default1]:[rank1]: return F.linear(input, weight, bias) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 839.94 MiB is free. Including non-PyTorch memory, this process has 78.50 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default0]:[rank0]: merged_states = self.gate_up_proj(hidden_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default0]:[rank0]: return column_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default0]:[rank0]: return F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank24]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank24]: output = model(**micro_batch) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank24]: sharded_logits = self.model( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank24]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank24]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank24]: output = self.pp_block(**new_kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank24]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank24]: return self._call_impl(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank24]: return forward_call(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 389, in forward [default0]:[rank24]: .contiguous() [default0]:[rank24]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank4]: output = self.pp_block(**new_kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default4]:[rank4]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default4]:[rank4]: merged_states = self.gate_up_proj(hidden_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default4]:[rank4]: return column_linear( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default4]:[rank4]: return F.linear(input, weight, bias) [default4]:[rank4]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 911.94 MiB is free. Including non-PyTorch memory, this process has 78.43 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank22]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank22]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank22]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank22]: output = self.pp_block(**new_kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default6]:[rank22]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default6]:[rank22]: output = self.o_proj(attention_output) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank22]: return row_linear( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default6]:[rank22]: out = differentiable_reduce_scatter_sum(out, group=group) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default6]:[rank22]: return DifferentiableReduceScatterSum.apply(tensor, group) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default6]:[rank22]: return super().apply(*args, **kwargs) # type: ignore[misc] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default6]:[rank22]: sharded_tensor = torch.empty( [default6]:[rank22]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 207.94 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank23]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank23]: output = model(**micro_batch) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank23]: sharded_logits = self.model( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank23]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank23]: output = self.pp_block(**new_kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default7]:[rank23]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default7]:[rank23]: output = self.o_proj(attention_output) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank23]: return row_linear( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default7]:[rank23]: out = differentiable_reduce_scatter_sum(out, group=group) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default7]:[rank23]: return DifferentiableReduceScatterSum.apply(tensor, group) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default7]:[rank23]: return super().apply(*args, **kwargs) # type: ignore[misc] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default7]:[rank23]: sharded_tensor = torch.empty( [default7]:[rank23]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 135.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank5]: output = self.pp_block(**new_kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default5]:[rank5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]:[rank5]: merged_states = self.gate_up_proj(hidden_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default5]:[rank5]: return column_linear( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default5]:[rank5]: return F.linear(input, weight, bias) [default5]:[rank5]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 839.94 MiB is free. Including non-PyTorch memory, this process has 78.50 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank7]: output = self.pp_block(**new_kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default7]:[rank7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default7]:[rank7]: merged_states = self.gate_up_proj(hidden_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default7]:[rank7]: return column_linear( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default7]:[rank7]: return F.linear(input, weight, bias) [default7]:[rank7]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 839.94 MiB is free. Including non-PyTorch memory, this process has 78.50 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank6]: output = self.pp_block(**new_kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default6]:[rank6]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default6]:[rank6]: merged_states = self.gate_up_proj(hidden_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default6]:[rank6]: return column_linear( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default6]:[rank6]: return F.linear(input, weight, bias) [default6]:[rank6]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 911.94 MiB is free. Including non-PyTorch memory, this process has 78.43 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:[rank30]: Traceback (most recent call last): [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank30]: trainer.train(dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank30]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank30]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank30]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank30]: output = model(**micro_batch) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank30]: sharded_logits = self.model( [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank30]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank30]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank30]: output = self.pp_block(**new_kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default6]:[rank30]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default6]:[rank30]: merged_states = self.gate_up_proj(hidden_states) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank30]: return self._call_impl(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank30]: return forward_call(*args, **kwargs) [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default6]:[rank30]: return column_linear( [default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default6]:[rank30]: return F.linear(input, weight, bias) [default6]:[rank30]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 119.94 MiB is free. Including non-PyTorch memory, this process has 79.20 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank31]: Traceback (most recent call last): [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank31]: trainer.train(dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank31]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank31]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank31]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank31]: output = model(**micro_batch) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank31]: sharded_logits = self.model( [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank31]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank31]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank31]: output = self.pp_block(**new_kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default7]:[rank31]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default7]:[rank31]: merged_states = self.gate_up_proj(hidden_states) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank31]: return self._call_impl(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank31]: return forward_call(*args, **kwargs) [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default7]:[rank31]: return column_linear( [default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default7]:[rank31]: return F.linear(input, weight, bias) [default7]:[rank31]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 271.94 MiB is free. Including non-PyTorch memory, this process has 79.05 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank21]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank21]: output = model(**micro_batch) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank21]: sharded_logits = self.model( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank21]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank21]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank21]: output = self.pp_block(**new_kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default5]:[rank21]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default5]:[rank21]: output = self.o_proj(attention_output) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank21]: return row_linear( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default5]:[rank21]: out = differentiable_reduce_scatter_sum(out, group=group) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default5]:[rank21]: return DifferentiableReduceScatterSum.apply(tensor, group) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default5]:[rank21]: return super().apply(*args, **kwargs) # type: ignore[misc] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default5]:[rank21]: sharded_tensor = torch.empty( [default5]:[rank21]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 79.11 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank19]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank19]: output = model(**micro_batch) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank19]: sharded_logits = self.model( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank19]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank19]: output = self.pp_block(**new_kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default3]:[rank19]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default3]:[rank19]: output = self.o_proj(attention_output) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank19]: return row_linear( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default3]:[rank19]: out = differentiable_reduce_scatter_sum(out, group=group) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default3]:[rank19]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default3]:[rank19]: return super().apply(*args, **kwargs) # type: ignore[misc] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default3]:[rank19]: sharded_tensor = torch.empty( [default3]:[rank19]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 135.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank18]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: output = model(**micro_batch) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank18]: sharded_logits = self.model( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank18]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank18]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank18]: output = self.pp_block(**new_kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default2]:[rank18]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default2]:[rank18]: output = self.o_proj(attention_output) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank18]: return row_linear( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default2]:[rank18]: out = differentiable_reduce_scatter_sum(out, group=group) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default2]:[rank18]: return DifferentiableReduceScatterSum.apply(tensor, group) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default2]:[rank18]: return super().apply(*args, **kwargs) # type: ignore[misc] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default2]:[rank18]: sharded_tensor = torch.empty( [default2]:[rank18]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 207.94 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank16]: trainer.train(dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank16]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank16]: output = model(**micro_batch) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank16]: sharded_logits = self.model( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank16]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank16]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank16]: output = self.pp_block(**new_kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default0]:[rank16]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 389, in forward [default0]:[rank16]: .contiguous() [default0]:[rank16]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank17]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank17]: output = model(**micro_batch) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank17]: sharded_logits = self.model( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank17]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank17]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank17]: output = self.pp_block(**new_kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward [default1]:[rank17]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 598, in forward [default1]:[rank17]: output = self.o_proj(attention_output) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank17]: return row_linear( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default1]:[rank17]: out = differentiable_reduce_scatter_sum(out, group=group) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default1]:[rank17]: return DifferentiableReduceScatterSum.apply(tensor, group) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default1]:[rank17]: return super().apply(*args, **kwargs) # type: ignore[misc] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default1]:[rank17]: sharded_tensor = torch.empty( [default1]:[rank17]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 79.11 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 171.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default2]:[rank2]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default2]:[rank2]: merged_states = self.gate_up_proj(hidden_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default2]:[rank2]: return column_linear( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default2]:[rank2]: return F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 911.94 MiB is free. Including non-PyTorch memory, this process has 78.43 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default3]:[rank3]: merged_states = self.gate_up_proj(hidden_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default3]:[rank3]: return column_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default3]:[rank3]: return F.linear(input, weight, bias) [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 839.94 MiB is free. Including non-PyTorch memory, this process has 78.50 GiB memory in use. Of the allocated memory 68.92 GiB is allocated by PyTorch, and 683.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank50]: Traceback (most recent call last): [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank50]: trainer.train(dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank50]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank50]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank50]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank50]: output = model(**micro_batch) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: sharded_logits = self.model( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank50]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank50]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: return forward_call(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank50]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank50]: pipeline_state.run_communication() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank50]: recv_activation_tensor = recv_activation() [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank50]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank50]: dist.recv( [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank50]: return func(*args, **kwargs) [default2]:[rank50]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank50]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank50]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank50]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank50]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f35af1a2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank50]: frame #1: + 0x5b3a23e (0x7f35e8cbf23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f35e8cb9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f35e8cb9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f35e8cbafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35e8c6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35e8c6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35e8c6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f35e8c6f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f35b047c189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f35b0483610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f35b04a2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #12: + 0x5adc309 (0x7f35e8c61309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #13: + 0x5ae6f10 (0x7f35e8c6bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: Traceback (most recent call last): [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: trainer.train(dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank48]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank48]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank55]: Traceback (most recent call last): [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank50]: frame #14: + 0x5ae6fa5 (0x7f35e8c6bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #15: + 0x5124446 (0x7f35e82a9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #16: + 0x1acf4b8 (0x7f35e4c544b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: output = model(**micro_batch) [default7]:[rank55]: trainer.train(dataloader) [default2]:[rank50]: frame #17: + 0x5aee004 (0x7f35e8c73004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank50]: frame #18: + 0x5af36b5 (0x7f35e8c786b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: frame #19: + 0xd2631e (0x7f35fb86231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank48]: sharded_logits = self.model( [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank50]: frame #20: + 0x47def4 (0x7f35fafb9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default2]:[rank50]: frame #21: + 0x1445a6 (0x5560c86785a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank55]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank50]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5560c8671a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank50]: frame #23: + 0x150866 (0x5560c8684866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: output = model(**micro_batch) [default0]:[rank48]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5560c866d142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5560c8678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: frame #26: PyObject_Call + 0xbc (0x5560c8684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return self._call_impl(*args, **kwargs) [default7]:[rank55]: return forward_call(*args, **kwargs) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank50]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5560c866b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5560c8678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank50]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank48]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank50]: frame #30: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank55]: sharded_logits = self.model( [default2]:[rank50]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: pipeline_state.run_communication() [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: recv_activation_tensor = recv_activation() [default2]:[rank50]: frame #32: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: return forward_call(*args, **kwargs) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank48]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank50]: frame #34: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank48]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank55]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank50]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank48]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank48]: dist.recv( [default7]:[rank55]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank50]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5560c8670f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank48]: return func(*args, **kwargs) [default2]:[rank50]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5560c8682c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #38: + 0x211239 (0x5560c8745239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank55]: return self._call_impl(*args, **kwargs) [default0]:[rank48]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank48]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank50]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5560c8671a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: return forward_call(*args, **kwargs) [default2]:[rank50]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5560c866d3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank48]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank55]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank48]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcbcd846897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank50]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5560c8678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank48]: frame #1: + 0x5b3a23e (0x7fcc0736323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5560c8668c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: pipeline_state.run_communication() [default0]:[rank48]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fcc0735dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5560c8678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank48]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcc0735df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #45: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcc0735efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #46: PyObject_Call + 0xbc (0x5560c8684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcc07313371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcc07313371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: recv_activation_tensor = recv_activation() [default2]:[rank50]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5560c866b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #48: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #49: PyObject_Call + 0xbc (0x5560c8684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank50]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5560c866b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcc07313371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcc07313371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcbceb20189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5560c8678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank48]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fcbceb27610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fcbceb46978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank50]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5560c8671007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank50]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5560c8682c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank50]: frame #54: + 0x211239 (0x5560c8745239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #12: + 0x5adc309 (0x7fcc07305309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank50]: frame #55: PyObject_Call + 0x207 (0x5560c8685067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank50]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5560c866b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #57: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #13: + 0x5ae6f10 (0x7fcc0730ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank50]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5560c86698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #59: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: dist.recv( [default2]:[rank50]: frame #60: PyObject_Call + 0xbc (0x5560c8684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank50]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5560c866b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank48]: frame #14: + 0x5ae6fa5 (0x7fcc0730ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: return func(*args, **kwargs) [default0]:[rank48]: frame #15: + 0x5124446 (0x7fcc0694d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank48]: frame #16: + 0x1acf4b8 (0x7fcc032f84b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #17: + 0x5aee004 (0x7fcc07317004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank50]: frame #62: + 0x150582 (0x5560c8684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank48]: frame #18: + 0x5af36b5 (0x7fcc0731c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #19: + 0xd2631e (0x7fcc19f0631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank50]: frame #63: PyObject_Call + 0xbc (0x5560c8684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #20: + 0x47def4 (0x7fcc1965def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank50]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank55]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank55]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7115640897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank55]: frame #1: + 0x5b3a23e (0x7f714f15d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f714f157c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f714f157f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f714f158fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f714f10d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f714f10d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f714f10d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f714f10d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #21: + 0x1445a6 (0x55fc850605a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55fc85059a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f711691a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7116921610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank55]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f7116940978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank48]: frame #23: + 0x150866 (0x55fc8506c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #12: + 0x5adc309 (0x7f714f0ff309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #13: + 0x5ae6f10 (0x7f714f109f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55fc85055142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55fc85060a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #26: PyObject_Call + 0xbc (0x55fc8506cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55fc850532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #14: + 0x5ae6fa5 (0x7f714f109fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55fc85060a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #15: + 0x5124446 (0x7f714e747446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #16: + 0x1acf4b8 (0x7f714b0f24b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #17: + 0x5aee004 (0x7f714f111004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank48]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #18: + 0x5af36b5 (0x7f714f1166b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank55]: frame #19: + 0xd2631e (0x7f7161d0031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: frame #20: + 0x47def4 (0x7f7161457ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank55]: frame #21: + 0x1445a6 (0x55ded07155a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55ded070ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #23: + 0x150866 (0x55ded0721866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #30: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ded070a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55ded0715a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #26: PyObject_Call + 0xbc (0x55ded0721f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #32: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ded07082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55ded0715a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #30: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #32: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #34: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #34: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ded070df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55ded071fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55fc85058f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #38: + 0x211239 (0x55ded07e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55ded070ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55fc8506ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ded070a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55ded0715a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ded0705c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55ded0715a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #38: + 0x211239 (0x55fc8512d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55fc85059a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55fc850553e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55fc85060a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55fc85050c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #45: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #46: PyObject_Call + 0xbc (0x55ded0721f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ded07082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #48: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #49: PyObject_Call + 0xbc (0x55ded0721f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55fc85060a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #45: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: Traceback (most recent call last): [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank48]: frame #46: PyObject_Call + 0xbc (0x55fc8506cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ded07082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: trainer.train(dataloader) [default7]:[rank55]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55ded0715a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank55]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ded070e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55ded071fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #54: + 0x211239 (0x55ded07e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #55: PyObject_Call + 0x207 (0x55ded0722067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ded07082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55fc850532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank55]: frame #57: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ded07068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #59: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #60: PyObject_Call + 0xbc (0x55ded0721f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ded07082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank48]: frame #48: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #62: + 0x150582 (0x55ded0721582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #49: PyObject_Call + 0xbc (0x55fc8506cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: frame #63: PyObject_Call + 0xbc (0x55ded0721f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank51]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank48]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55fc850532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55fc85060a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55fc85059007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank48]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55fc8506ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #54: + 0x211239 (0x55fc8512d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #55: PyObject_Call + 0x207 (0x55fc8506d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank55]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:[rank48]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55fc850532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: output = model(**micro_batch) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default0]:[rank48]: frame #57: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank48]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55fc850518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #59: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #60: PyObject_Call + 0xbc (0x55fc8506cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55fc850532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #62: + 0x150582 (0x55fc8506c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: frame #63: PyObject_Call + 0xbc (0x55fc8506cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank48]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank51]: sharded_logits = self.model( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank51]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank51]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank51]: return self._call_impl(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank51]: return forward_call(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank51]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank51]: pipeline_state.run_communication() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank51]: recv_activation_tensor = recv_activation() [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank51]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank51]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank51]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank51]: dist.recv( [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank51]: return func(*args, **kwargs) [default3]:[rank51]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank51]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank51]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank51]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank51]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f13fcfd7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank51]: frame #1: + 0x5b3a23e (0x7f1436af423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1436aeec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1436aeef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1436aeffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1436aa4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1436aa4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1436aa4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1436aa4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f13fe2b1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f13fe2b8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f13fe2d7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank51]: frame #12: + 0x5adc309 (0x7f1436a96309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #13: + 0x5ae6f10 (0x7f1436aa0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #14: + 0x5ae6fa5 (0x7f1436aa0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #15: + 0x5124446 (0x7f14360de446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #16: + 0x1acf4b8 (0x7f1432a894b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #17: + 0x5aee004 (0x7f1436aa8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #18: + 0x5af36b5 (0x7f1436aad6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank51]: frame #19: + 0xd2631e (0x7f144969731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #20: + 0x47def4 (0x7f1448deeef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank51]: frame #21: + 0x1445a6 (0x555bbb7495a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555bbb742a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #23: + 0x150866 (0x555bbb755866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555bbb73e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555bbb749a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #26: PyObject_Call + 0xbc (0x555bbb755f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555bbb73c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555bbb749a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #30: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #32: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #34: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555bbb741f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555bbb753c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #38: + 0x211239 (0x555bbb816239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555bbb742a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555bbb73e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555bbb749a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555bbb739c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555bbb749a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #45: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #46: PyObject_Call + 0xbc (0x555bbb755f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555bbb73c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #48: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #49: PyObject_Call + 0xbc (0x555bbb755f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555bbb73c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555bbb749a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555bbb742007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555bbb753c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #54: + 0x211239 (0x555bbb816239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #55: PyObject_Call + 0x207 (0x555bbb756067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555bbb73c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #57: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555bbb73a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #59: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #60: PyObject_Call + 0xbc (0x555bbb755f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555bbb73c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #62: + 0x150582 (0x555bbb755582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: frame #63: PyObject_Call + 0xbc (0x555bbb755f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank51]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank53]: Traceback (most recent call last): [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank53]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank53]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: Traceback (most recent call last): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank54]: Traceback (most recent call last): [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank49]: trainer.train(dataloader) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank54]: trainer.train(dataloader) [default5]:[rank53]: output = model(**micro_batch) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank54]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank49]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank54]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank49]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank53]: return forward_call(*args, **kwargs) [default1]:[rank49]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank54]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: sharded_logits = self.model( [default1]:[rank49]: output = model(**micro_batch) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: output = model(**micro_batch) [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default1]:[rank49]: return forward_call(*args, **kwargs) [default5]:[rank53]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank53]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: return forward_call(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: sharded_logits = self.model( [default1]:[rank49]: sharded_logits = self.model( [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank54]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank49]: return forward_call(*args, **kwargs) [default5]:[rank53]: recv_activation_tensor = recv_activation() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank54]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank53]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank49]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank53]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank54]: return self._call_impl(*args, **kwargs) [default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank49]: return self._call_impl(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank54]: return forward_call(*args, **kwargs) [default5]:[rank53]: dist.recv( [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank49]: return forward_call(*args, **kwargs) [default6]:[rank54]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: return func(*args, **kwargs) [default1]:[rank49]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank54]: pipeline_state.run_communication() [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank53]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: pipeline_state.run_communication() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank53]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: recv_activation_tensor = recv_activation() [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank54]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank49]: recv_activation_tensor = recv_activation() [default5]:[rank53]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank53]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe73403c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank49]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank53]: frame #1: + 0x5b3a23e (0x7fe76db5923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank54]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank49]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank53]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fe76db53c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank53]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe76db53f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: dist.recv( [default5]:[rank53]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe76db54fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank54]: dist.recv( [default5]:[rank53]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe76db09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: return func(*args, **kwargs) [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank49]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank54]: return func(*args, **kwargs) [default5]:[rank53]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe76db09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank54]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank49]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank54]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank53]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe76db09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank54]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank54]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank49]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2e4f4d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank53]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe76db09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f024f690897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank49]: frame #1: + 0x5b3a23e (0x7fd31ea6a23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe735316189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #1: + 0x5b3a23e (0x7f02891ad23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe73531d610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f02891a7c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fd31ea64c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fe73533c978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f02891a7f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd31ea64f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #12: + 0x5adc309 (0x7fe76dafb309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f02891a8fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd31ea65fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f028915d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #13: + 0x5ae6f10 (0x7fe76db05f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd31ea1a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd31ea1a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f028915d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f028915d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f028915d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f025096a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank49]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd31ea1a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd31ea1a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0250971610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0250990978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #12: + 0x5adc309 (0x7f028914f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #13: + 0x5ae6f10 (0x7f0289159f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd2e6227189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]: frame #14: + 0x5ae6fa5 (0x7fe76db05fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #15: + 0x5124446 (0x7fe76d143446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #14: + 0x5ae6fa5 (0x7f0289159fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #16: + 0x1acf4b8 (0x7fe769aee4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fd2e622e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank53]: frame #17: + 0x5aee004 (0x7fe76db0d004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #15: + 0x5124446 (0x7f0288797446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #18: + 0x5af36b5 (0x7fe76db126b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fd2e624d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank54]: frame #16: + 0x1acf4b8 (0x7f02851424b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #19: + 0xd2631e (0x7fe7806fc31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank54]: frame #17: + 0x5aee004 (0x7f0289161004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank49]: frame #12: + 0x5adc309 (0x7fd31ea0c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #20: + 0x47def4 (0x7fe77fe53ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank49]: frame #13: + 0x5ae6f10 (0x7fd31ea16f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #18: + 0x5af36b5 (0x7f02891666b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #21: + 0x1445a6 (0x560d144ff5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #19: + 0xd2631e (0x7f029bd5031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560d144f8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #23: + 0x150866 (0x560d1450b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560d144f4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560d144ffa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #14: + 0x5ae6fa5 (0x7fd31ea16fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #20: + 0x47def4 (0x7f029b4a7ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #26: PyObject_Call + 0xbc (0x560d1450bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #21: + 0x1445a6 (0x5566862045a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #15: + 0x5124446 (0x7fd31e054446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560d144f22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #16: + 0x1acf4b8 (0x7fd31a9ff4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5566861fda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560d144ffa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #17: + 0x5aee004 (0x7fd31ea1e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank54]: frame #23: + 0x150866 (0x556686210866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5566861f9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #25: _PyFunction_Vectorcall + 0x6c (0x556686204a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #18: + 0x5af36b5 (0x7fd31ea236b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank53]: frame #30: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #19: + 0xd2631e (0x7fd33160d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #32: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #26: PyObject_Call + 0xbc (0x556686210f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #20: + 0x47def4 (0x7fd330d64ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank53]: frame #34: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #21: + 0x1445a6 (0x55bb2ee9a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55bb2ee93a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5566861f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #23: + 0x150866 (0x55bb2eea6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560d144f7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #28: _PyFunction_Vectorcall + 0x6c (0x556686204a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bb2ee8f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55bb2ee9aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #26: PyObject_Call + 0xbc (0x55bb2eea6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #30: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb2ee8d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560d14509c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55bb2ee9aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #38: + 0x211239 (0x560d145cc239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #32: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560d144f8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #34: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560d144f43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #30: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560d144ffa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560d144efc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #32: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560d144ffa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #45: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #34: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5566861fcf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #46: PyObject_Call + 0xbc (0x560d1450bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55668620ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #38: + 0x211239 (0x5566862d1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bb2ee92f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5566861fda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55bb2eea4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560d144f22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5566861f93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #38: + 0x211239 (0x55bb2ef67239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #48: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55bb2ee93a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bb2ee8f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #49: PyObject_Call + 0xbc (0x560d1450bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55bb2ee9aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560d144f22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bb2ee8ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55bb2ee9aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #41: _PyFunction_Vectorcall + 0x6c (0x556686204a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560d144ffa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560d144f8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5566861f4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560d14509c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #43: _PyFunction_Vectorcall + 0x6c (0x556686204a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #45: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #46: PyObject_Call + 0xbc (0x55bb2eea6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #54: + 0x211239 (0x560d145cc239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb2ee8d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #48: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #55: PyObject_Call + 0x207 (0x560d1450c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #45: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560d144f22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #46: PyObject_Call + 0xbc (0x556686210f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #49: PyObject_Call + 0xbc (0x55bb2eea6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5566861f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #57: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb2ee8d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #48: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560d144f08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55bb2ee9aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #49: PyObject_Call + 0xbc (0x556686210f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5566861f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bb2ee93007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #51: _PyFunction_Vectorcall + 0x6c (0x556686204a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #59: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55bb2eea4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5566861fd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #60: PyObject_Call + 0xbc (0x560d1450bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55668620ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560d144f22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #54: + 0x211239 (0x55bb2ef67239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #54: + 0x211239 (0x5566862d1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #55: PyObject_Call + 0x207 (0x55bb2eea7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #62: + 0x150582 (0x560d1450b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #55: PyObject_Call + 0x207 (0x556686211067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: frame #63: PyObject_Call + 0xbc (0x560d1450bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb2ee8d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5566861f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #57: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank53]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank49]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bb2ee8b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #57: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5566861f58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #59: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #60: PyObject_Call + 0xbc (0x55bb2eea6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb2ee8d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #59: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #60: PyObject_Call + 0xbc (0x556686210f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5566861f72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #62: + 0x150582 (0x556686210582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: frame #63: PyObject_Call + 0xbc (0x556686210f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank54]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank49]: frame #62: + 0x150582 (0x55bb2eea6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: frame #63: PyObject_Call + 0xbc (0x55bb2eea6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank49]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank43]: Traceback (most recent call last): [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank43]: trainer.train(dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank43]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank43]: output = model(**micro_batch) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: sharded_logits = self.model( [default1]:[rank41]: Traceback (most recent call last): [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: trainer.train(dataloader) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: return forward_call(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank43]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank43]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: Traceback (most recent call last): [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank45]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank45]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank45]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank41]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank43]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank45]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank41]: output = model(**micro_batch) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default5]:[rank45]: output = model(**micro_batch) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: pipeline_state.run_communication() [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: return forward_call(*args, **kwargs) [default3]:[rank43]: recv_activation_tensor = recv_activation() [default5]:[rank45]: return forward_call(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank41]: sharded_logits = self.model( [default3]:[rank43]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank45]: sharded_logits = self.model( [default3]:[rank43]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: return forward_call(*args, **kwargs) [default1]:[rank41]: return forward_call(*args, **kwargs) [default3]:[rank43]: dist.recv( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank45]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank43]: return func(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank45]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank41]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank41]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank45]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank43]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank41]: return forward_call(*args, **kwargs) [default2]:[rank42]: Traceback (most recent call last): [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank43]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: trainer.train(dataloader) [default1]:[rank41]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank43]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6e19385897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank45]: return forward_call(*args, **kwargs) [default1]:[rank41]: pipeline_state.run_communication() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: recv_activation_tensor = recv_activation() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank42]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank43]: frame #1: + 0x5b3a23e (0x7f6e52ea223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank41]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank45]: pipeline_state.run_communication() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6e52e9cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6e52e9cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6e52e9dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank45]: recv_activation_tensor = recv_activation() [default1]:[rank41]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank43]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6e52e52371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank45]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank43]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6e52e52371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank41]: dist.recv( [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6e52e52371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank45]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank41]: return func(*args, **kwargs) [default3]:[rank43]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6e52e52371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank43]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6e1a65f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank41]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank43]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6e1a666610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: dist.recv( [default3]:[rank43]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6e1a685978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #12: + 0x5adc309 (0x7f6e52e44309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank41]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank43]: frame #13: + 0x5ae6f10 (0x7f6e52e4ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank41]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2bd093897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank41]: frame #1: + 0x5b3a23e (0x7fa2f6bb023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa2f6baac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: return func(*args, **kwargs) [default5]:[rank45]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank43]: frame #14: + 0x5ae6fa5 (0x7f6e52e4efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa2f6baaf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank42]: output = model(**micro_batch) [default3]:[rank43]: frame #15: + 0x5124446 (0x7f6e5248c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #16: + 0x1acf4b8 (0x7f6e4ee374b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank45]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank41]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa2f6babfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa2f6b60371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #17: + 0x5aee004 (0x7f6e52e56004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6aa0d5b897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank45]: frame #1: + 0x5b3a23e (0x7f6ada87823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: Traceback (most recent call last): [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank33]: trainer.train(dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank33]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank33]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank33]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default2]:[rank42]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: frame #18: + 0x5af36b5 (0x7f6e52e5b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank33]: output = model(**micro_batch) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: frame #19: + 0xd2631e (0x7f6e65a4531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6ada872c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: sharded_logits = self.model( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank33]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank33]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank33][default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl : File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank33]: return self._call_impl(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank33]: return forward_call(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank33]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank33]: pipeline_state.run_communication() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel[default3]:[rank43]: frame #20: + 0x47def4 (0x7f6e6519cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank41]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa2f6b60371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6ada872f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: Traceback (most recent call last): [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank57]: trainer.train(dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: outputs = self.pipeline_engine.train_batch_iter( /pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank33]: recv_activation_tensor = recv_activation() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank33]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: frame #21: + 0x1445a6 (0x55eb8583e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank33]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank33]: dist.recv( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank33]: return func(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank33]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank33[default5]:[rank45]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6ada873fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ada828371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank57]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank57]: output = model(**micro_batch) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clust]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank33]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank33]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6ae2aeb897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank33]: frame #1: + 0x5b3a23e (0x7f6b1c60823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f6b1c602c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33[default2]:[rank42]: return forward_call(*args, **kwargs) er/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: sharded_logits = self.model( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states ]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6b1c602f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6b1c603fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6b1c5b8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6b1c5b8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6b1c5b8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [defau[default1]:[rank41]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa2f6b60371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: return forward_call(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank57]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank57]: pipeline_state.run_communicationlt1]:[rank33]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6b1c5b8371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6ae3dc5189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6ae3dcc610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6ae3deb978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #12: + 0x5adc309 (0x7f6b1c5aa309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #13: + 0x5ae6f10 (0x7f6b1c5b4f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #14: + 0x5ae6fa5 (0x7f6b1c5b4fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #15: + 0x5124446 (0x7f6b1bbf2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #16: + 0x1acf4b8 (0x7f6b1859d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #17: + 0x5aee004 (0x7f6b1c5bc004 in /fsx/ferdinandmom/miniforge3/en[default3]:[rank43]: frame #23: + 0x150866 (0x55eb8584a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank57]: dist.recv( [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank57]: return func(*args, **kwargs) [default1]:[rank57]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank57]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank57]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default1]:[rank57]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank57]: frame #0: c10::Errorvs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #18: + 0x5af36b5 (0x7f6b1c5c16b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #19: + 0xd2631e (0x7f6b2f1ab31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #20: + 0x47def4 (0x7f6b2e902ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #21: + 0x1445a6 (0x556d87ac95a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #22: _PyObject_MakeTpCall + 0x26b (0x556d87ac2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #23: + 0x150866 (0x556d87ad586[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank42]: sharded_logits = self.model( ::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc19e8bd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:[rank57]: frame #1: + 0x5b3a23e (0x7fc1d83da23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fc1d83d4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc1d83d4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc1d83d5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556d87abe142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #25: _PyFunction_Vectorcall + 0x6c (0x556d87ac9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #26: PyObject_Call + 0xbc (0x556d87ad5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556d87abc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #28: _PyFunction_Vectorcall + 0x6c (0x556d87ac9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #30: + 0[default5]:[rank45]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ada828371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) libtorch_cpu.so) [default1]:[rank57]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1d838a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #32: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ada828371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1d838a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #34: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556d87ac1f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #37: _PyObject_Call_Prepend + 0x69 (0x556d87ad3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #38: + 0x211239 (0x556d87b96239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #39: _PyObject_MakeTpCall + 0x26b (0x556d87ac2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ada828371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa2f6b60371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1d838a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556d87abe3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #41: _PyFunction_Vectorcall + 0x6c (0x556d87ac9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556d87ab9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #43: _PyFunction_Vectorcall + 0x6c (0x556d87ac9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #45: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55eb85833142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1d838a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc19fb97189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #46: PyObject_Call + 0xbc (0x556d87ad5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556d87abc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #48: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc19fb9e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fc19fbbd978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank33]: frame #49: PyObject_Call + 0xbc (0x556d87ad5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556d87abc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #51: _PyFunction_Vectorcall + 0x6c (0x556d87ac9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6aa2035189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6aa203c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55eb8583ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #12: + 0x5adc309 (0x7fc1d837c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #13: + 0x5ae6f10 (0x7fc1d8386f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556d87ac2007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #53: _PyObject_Call_Prepend + 0x69 (0x556d87ad3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #54: + 0x211239 (0x556d87b96239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #55: PyObject_Call + 0x207 (0x556d87ad6067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556d87abc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa2be36d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]: frame #14: + 0x5ae6fa5 (0x7fc1d8386fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #15: + 0x5124446 (0x7fc1d79c4446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #57: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556d87aba8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return forward_call(*args, **kwargs) [default1]:[rank57]: frame #16: + 0x1acf4b8 (0x7fc1d436f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #17: + 0x5aee004 (0x7fc1d838e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank33]: frame #59: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #60: PyObject_Call + 0xbc (0x556d87ad5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556d87abc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f6aa205b978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #26: PyObject_Call + 0xbc (0x55eb8584af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #18: + 0x5af36b5 (0x7fc1d83936b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #19: + 0xd2631e (0x7fc1eaf7d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #20: + 0x47def4 (0x7fc1ea6d4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank33]: frame #62: + 0x150582 (0x556d87ad5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: frame #63: PyObject_Call + 0xbc (0x556d87ad5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: frame #21: + 0x1445a6 (0x55b729eba5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b729eb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank33]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank41]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa2be374610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank41]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa2be393978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:[rank57]: frame #23: + 0x150866 (0x55b729ec6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b729eaf142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #12: + 0x5adc309 (0x7f6ada81a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb858312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #13: + 0x5ae6f10 (0x7f6ada824f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b729ebaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank57]: frame #26: PyObject_Call + 0xbc (0x55b729ec6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank56]: Traceback (most recent call last): [default3]:[rank59]: Traceback (most recent call last): [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank59]: trainer.train(dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank42]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank57]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b729ead2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55eb8583ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b729ebaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #30: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank56]: trainer.train(dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank57]: frame #30: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #14: + 0x5ae6fa5 (0x7f6ada824fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #15: + 0x5124446 (0x7f6ad9e62446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank57]: frame #32: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank45]: frame #16: + 0x1acf4b8 (0x7f6ad680d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #12: + 0x5adc309 (0x7fa2f6b52309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank41]: frame #13: + 0x5ae6f10 (0x7fa2f6b5cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: frame #17: + 0x5aee004 (0x7f6ada82c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #18: + 0x5af36b5 (0x7f6ada8316b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #34: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: Traceback (most recent call last): [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: trainer.train(dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: return self._call_impl(*args, **kwargs) [default1]:[rank57]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank59]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank57]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b729eb2f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank45]: frame #19: + 0xd2631e (0x7f6aed41b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank45]: frame #20: + 0x47def4 (0x7f6aecb72ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank57]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b729ec4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #38: + 0x211239 (0x55b729f87239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b729eb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #14: + 0x5ae6fa5 (0x7fa2f6b5cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b729eaf3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return forward_call(*args, **kwargs) [default1]:[rank41]: frame #15: + 0x5124446 (0x7fa2f619a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #16: + 0x1acf4b8 (0x7fa2f2b454b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b729ebaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: Traceback (most recent call last): [default5]:[rank45]: frame #21: + 0x1445a6 (0x5615633cc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank37]: trainer.train(dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank37]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank56]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank57]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b729eaac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank59]: output = model(**micro_batch) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank34]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank43]: frame #32: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b729ebaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default1]:[rank41]: frame #17: + 0x5aee004 (0x7fa2f6b64004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank41]: frame #18: + 0x5af36b5 (0x7fa2f6b696b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank57]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank57]: frame #45: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: output = model(**micro_batch) [default5]:[rank45]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5615633c5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #19: + 0xd2631e (0x7fa30975331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: output = model(**micro_batch) [default5]:[rank45]: frame #23: + 0x150866 (0x5615633d8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5615633c1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #46: PyObject_Call + 0xbc (0x55b729ec6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: frame #20: + 0x47def4 (0x7fa308eaaef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank42]: pipeline_state.run_communication() [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank45]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5615633cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #26: PyObject_Call + 0xbc (0x5615633d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: output = model(**micro_batch) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank43]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #34: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b729ead2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #48: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: return forward_call(*args, **kwargs) [default2]:[rank42]: recv_activation_tensor = recv_activation() [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: sharded_logits = self.model( [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank43]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55eb85836f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: sharded_logits = self.model( [default1]:[rank41]: frame #21: + 0x1445a6 (0x55cfc520e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank58]: Traceback (most recent call last): [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: return forward_call(*args, **kwargs) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank45]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5615633bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return forward_call(*args, **kwargs) [default1]:[rank57]: frame #49: PyObject_Call + 0xbc (0x55b729ec6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b729ead2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b729ebaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank45]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5615633cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55eb85848c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank57]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b729eb3007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default3]:[rank43]: frame #38: + 0x211239 (0x55eb8590b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return forward_call(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank43]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55eb85837a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: sharded_logits = self.model( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank45]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank37]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55cfc5207a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #23: + 0x150866 (0x55cfc521a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55eb858333e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: sharded_logits = self.model( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank45]: frame #30: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55eb8583ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank57]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b729ec4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank45]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55cfc5203142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55cfc520ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #54: + 0x211239 (0x55b729f87239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank41]: frame #26: PyObject_Call + 0xbc (0x55cfc521af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: trainer.train(dataloader) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: return forward_call(*args, **kwargs) [default3]:[rank43]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55eb8582ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #55: PyObject_Call + 0x207 (0x55b729ec7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55cfc52012b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b729ead2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank41]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55cfc520ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank43]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55eb8583ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #57: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank45]: frame #32: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b729eab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank43]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank42]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: pipeline_state.run_communication() [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: pipeline_state.run_communication() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank56]: return forward_call(*args, **kwargs) [default2]:[rank34]: recv_activation_tensor = recv_activation() [default3]:[rank43]: frame #45: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank57]: frame #59: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank37]: recv_activation_tensor = recv_activation() [default1]:[rank41]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #60: PyObject_Call + 0xbc (0x55b729ec6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: Traceback (most recent call last): [default5]:[rank61]: Traceback (most recent call last): [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank41]: frame #30: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #46: PyObject_Call + 0xbc (0x55eb8584af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank37]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #34: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b729ead2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank41]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank57]: frame #62: + 0x150582 (0x55b729ec6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank45]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return forward_call(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank45]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5615633c4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: trainer.train(dataloader) [default1]:[rank57]: frame #63: PyObject_Call + 0xbc (0x55b729ec6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank34]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank43]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb858312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #48: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank41]: frame #32: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #34: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5615633d6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank45]: frame #38: + 0x211239 (0x561563499239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #49: PyObject_Call + 0xbc (0x55eb8584af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank57]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: dist.recv( [default0]:[rank56]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank45]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5615633c5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank37]: dist.recv( [default5]:[rank45]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5615633c13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55cfc5206f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:[rank41]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55cfc5218c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: dist.recv( [default3]:[rank43]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb858312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: return func(*args, **kwargs) [default4]:[rank60]: trainer.train(dataloader) [default3]:[rank59]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank56]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: return func(*args, **kwargs) [default1]:[rank41]: frame #38: + 0x211239 (0x55cfc52db239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55cfc5207a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank37]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank41]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55cfc52033e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: return func(*args, **kwargs) [default5]:[rank45]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5615633cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank45]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5615633bcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55cfc520ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank41]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55cfc51fec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank37]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fadb08a4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank45]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5615633cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank37]: frame #1: + 0x5b3a23e (0x7fadea3c123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank43]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55eb8583ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank37]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fadea3bbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fadea3bbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #45: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank34]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank34]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f254439e897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank45]: frame #46: PyObject_Call + 0xbc (0x5615633d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55eb85837007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank37]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fadea3bcfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadea371371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55cfc520ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank58]: output = model(**micro_batch) [default2]:[rank34]: frame #1: + 0x5b3a23e (0x7f257debb23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f257deb5c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadea371371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5615633bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55eb85848c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadea371371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #48: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: output = model(**micro_batch) [default5]:[rank37]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fadea371371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #54: + 0x211239 (0x55eb8590b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank34]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f257deb5f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #45: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank59]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f257deb6fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f257de6b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f257de6b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #49: PyObject_Call + 0xbc (0x5615633d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f257de6b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5615633bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f257de6b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #46: PyObject_Call + 0xbc (0x55cfc521af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank34]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2545678189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5615633cca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5615633c5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f254567f610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fadb1b7e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank45]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5615633d6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f254569e978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #55: PyObject_Call + 0x207 (0x55eb8584b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #12: + 0x5adc309 (0x7f257de5d309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55cfc52012b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #13: + 0x5ae6f10 (0x7f257de67f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb858312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fadb1b85610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank37]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fadb1ba4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank43]: frame #57: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default1]:[rank41]: frame #48: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #49: PyObject_Call + 0xbc (0x55cfc521af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #12: + 0x5adc309 (0x7fadea363309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #13: + 0x5ae6f10 (0x7fadea36df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55cfc52012b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank34]: frame #14: + 0x5ae6fa5 (0x7f257de67fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #15: + 0x5124446 (0x7f257d4a5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #54: + 0x211239 (0x561563499239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #55: PyObject_Call + 0x207 (0x5615633d9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: output = model(**micro_batch) [default5]:[rank37]: frame #14: + 0x5ae6fa5 (0x7fadea36dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55eb8582f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank34]: frame #16: + 0x1acf4b8 (0x7f2579e504b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efffe4ee897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank61]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #15: + 0x5124446 (0x7fade99ab446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #16: + 0x1acf4b8 (0x7fade63564b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5615633bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #57: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: frame #17: + 0x5aee004 (0x7fadea375004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5615633bd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: return forward_call(*args, **kwargs) [default2]:[rank34]: frame #17: + 0x5aee004 (0x7f257de6f004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #18: + 0x5af36b5 (0x7f257de746b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank43]: frame #59: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank37]: frame #18: + 0x5af36b5 (0x7fadea37a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #59: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank41]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55cfc520ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank41]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55cfc5207007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #1: + 0x5b3a23e (0x7f003800b23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: sharded_logits = self.model( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank56]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank43]: frame #60: PyObject_Call + 0xbc (0x55eb8584af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb858312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank43]: frame #62: + 0x150582 (0x55eb8584a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default1]:[rank41]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55cfc5218c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #19: + 0xd2631e (0x7f2590a5e31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank41]: frame #54: + 0x211239 (0x55cfc52db239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank43]: frame #63: PyObject_Call + 0xbc (0x55eb8584af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: pipeline_state.run_communication() [default2]:[rank34]: frame #20: + 0x47def4 (0x7f25901b5ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank34]: frame #21: + 0x1445a6 (0x55a30f0bd5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #60: PyObject_Call + 0xbc (0x5615633d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5615633bf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank37]: frame #19: + 0xd2631e (0x7fadfcf6431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:[rank41]: frame #55: PyObject_Call + 0x207 (0x55cfc521b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0038005c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: frame #20: + 0x47def4 (0x7fadfc6bbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: Traceback (most recent call last): [default3]:[rank43]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a30f0b6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0038005f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank37]: frame #21: + 0x1445a6 (0x555eeeb655a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55cfc52012b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: return forward_call(*args, **kwargs) [default4]:[rank60]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555eeeb5ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank45]: frame #62: + 0x150582 (0x5615633d8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0038006fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:[rank41]: frame #57: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: frame #23: + 0x150866 (0x555eeeb71866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0037fbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: sharded_logits = self.model( [default3]:[rank59]: recv_activation_tensor = recv_activation() [default5]:[rank37]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555eeeb5a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55cfc51ff8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #59: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: pipeline_state.run_communication() [default4]:[rank36]: Traceback (most recent call last): [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: Traceback (most recent call last): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: trainer.train(dataloader) [default2]:[rank42]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0037fbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555eeeb65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: trainer.train(dataloader) [default2]:[rank42]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0037fbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank45]: frame #63: PyObject_Call + 0xbc (0x5615633d8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #60: PyObject_Call + 0xbc (0x55cfc521af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank37]: frame #26: PyObject_Call + 0xbc (0x555eeeb71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0037fbb371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank36]: trainer.train(dataloader) [default5]:[rank45]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:[rank41]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55cfc52012b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: frame #62: + 0x150582 (0x55cfc521a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: frame #63: PyObject_Call + 0xbc (0x55cfc521af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7effff7c8189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank34]: frame #23: + 0x150866 (0x55a30f0c9866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:[rank41]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank39]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7effff7cf610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank42]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7effff7ee978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: frame #12: + 0x5adc309 (0x7f0037fad309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #13: + 0x5ae6f10 (0x7f0037fb7f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: frame #14: + 0x5ae6fa5 (0x7f0037fb7fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #15: + 0x5124446 (0x7f00375f5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #16: + 0x1acf4b8 (0x7f0033fa04b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank37]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555eeeb582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: frame #17: + 0x5aee004 (0x7f0037fbf004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #18: + 0x5af36b5 (0x7f0037fc46b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank42]: frame #19: + 0xd2631e (0x7f004abae31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank34]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a30f0b2142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #20: + 0x47def4 (0x7f004a305ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank42]: frame #21: + 0x1445a6 (0x55adf88275a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55adf8820a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #23: + 0x150866 (0x55adf8833866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: recv_activation_tensor = recv_activation() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank42]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55adf881c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55adf8827a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #26: PyObject_Call + 0xbc (0x55adf8833f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55adf881a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55adf8827a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank34]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a30f0bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #30: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #32: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank39]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank42]: frame #34: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55adf881ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55adf8831c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555eeeb65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #38: + 0x211239 (0x55adf88f4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55adf8820a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55adf881c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55adf8827a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #26: PyObject_Call + 0xbc (0x55a30f0c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55adf8817c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55adf8827a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank42]: frame #45: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #46: PyObject_Call + 0xbc (0x55adf8833f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55adf881a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #48: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: sharded_logits = self.model( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default2]:[rank34]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a30f0b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a30f0bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #49: PyObject_Call + 0xbc (0x55adf8833f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55adf881a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55adf8827a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55adf8820007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank42]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55adf8831c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #54: + 0x211239 (0x55adf88f4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #55: PyObject_Call + 0x207 (0x55adf8834067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank56]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank42]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55adf881a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #57: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55adf88188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #59: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank42]: frame #60: PyObject_Call + 0xbc (0x55adf8833f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55adf881a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: frame #62: + 0x150582 (0x55adf8833582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank42]: frame #63: PyObject_Call + 0xbc (0x55adf8833f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank42]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: Traceback (most recent call last): [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank61]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank35]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank36]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank44]: Traceback (most recent call last): [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank44]: trainer.train(dataloader) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank44]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank37]: frame #30: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank40]: trainer.train(dataloader) [default4]:[rank60]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank56]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank37]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default5]:[rank61]: return forward_call(*args, **kwargs) [default3]:[rank35]: output = model(**micro_batch) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: output = model(**micro_batch) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default0]:[rank40]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #30: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank34]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #32: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default0]:[rank40]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank56]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank36]: output = model(**micro_batch) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: return self._call_impl(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: output = model(**micro_batch) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: output = model(**micro_batch) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank37]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: return forward_call(*args, **kwargs) [default5]:[rank61]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank34]: frame #32: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: return forward_call(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank34]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: sharded_logits = self.model( [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank61]: pipeline_state.run_communication() [default4]:[rank60]: return forward_call(*args, **kwargs) [default5]:[rank37]: frame #34: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #34: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: sharded_logits = self.model( [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank44]: return forward_call(*args, **kwargs) [default0]:[rank56]: dist.recv( [default2]:[rank34]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a30f0b5f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: sharded_logits = self.model( [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank58]: return forward_call(*args, **kwargs) [default7]:[rank39]: return forward_call(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank37]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank44]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: dist.recv( [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a30f0c7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #38: + 0x211239 (0x55a30f18a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank56]: return func(*args, **kwargs) [default4]:[rank36]: return forward_call(*args, **kwargs) [default0]:[rank40]: return forward_call(*args, **kwargs) [default5]:[rank61]: recv_activation_tensor = recv_activation() [default2]:[rank58]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default4]:[rank44]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank34]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a30f0b6a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank59]: return func(*args, **kwargs) [default5]:[rank37]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555eeeb5df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank36]: sharded_logits = self.model( [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank44]: return self._call_impl(*args, **kwargs) [default0]:[rank56]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank44]: return forward_call(*args, **kwargs) [default3]:[rank59]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank60]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default2]:[rank34]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a30f0b23e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank37]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555eeeb6fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #38: + 0x211239 (0x555eeec32239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank58]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank35]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank56]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcd9b1ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a30f0bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank44]: pipeline_state.run_communication() [default2]:[rank58]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank61]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank44]: recv_activation_tensor = recv_activation() [default0]:[rank56]: frame #1: + 0x5b3a23e (0x7fcdd4cc823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: return forward_call(*args, **kwargs) [default4]:[rank60]: pipeline_state.run_communication() [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fcdd4cc2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank34]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a30f0adc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: sharded_logits = self.model( [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank58]: return forward_call(*args, **kwargs) [default0]:[rank56]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcdd4cc2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555eeeb5ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555eeeb5a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank34]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a30f0bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: dist.recv( [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank35]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default4]:[rank36]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3152507897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank37]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555eeeb65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank34]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: pipeline_state.run_communication() [default4]:[rank44]: return func(*args, **kwargs) [default4]:[rank44]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank58]: pipeline_state.run_communication() [default5]:[rank37]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555eeeb55c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555eeeb65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: recv_activation_tensor = recv_activation() [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: recv_activation_tensor = recv_activation() [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank36]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank44]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank56]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcdd4cc3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank40]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank37]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #45: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #46: PyObject_Call + 0xbc (0x55a30f0c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: return self._call_impl(*args, **kwargs) [default5]:[rank37]: frame #45: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank58]: recv_activation_tensor = recv_activation() [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: return forward_call(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank44]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank56]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcdd4c78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank36]: return forward_call(*args, **kwargs) [default4]:[rank44]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f871ace0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a30f0b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #48: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: return forward_call(*args, **kwargs) [default0]:[rank40]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank44]: frame #1: + 0x5b3a23e (0x7f87547fd23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #1: + 0x5b3a23e (0x7f318c02423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #46: PyObject_Call + 0xbc (0x555eeeb71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555eeeb582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank56]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcdd4c78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #48: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank35]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank44]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f87547f7c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank39]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank40]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f318c01ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #49: PyObject_Call + 0xbc (0x55a30f0c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #49: PyObject_Call + 0xbc (0x555eeeb71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f87547f7f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank35]: pipeline_state.run_communication() [default4]:[rank36]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank44]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f87547f8fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f87547ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank37]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555eeeb582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f87547ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f318c01ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: dist.recv( [default4]:[rank44]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f87547ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank56]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcdd4c78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: dist.recv( [default2]:[rank34]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a30f0b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank34]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a30f0bda2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: return func(*args, **kwargs) [default0]:[rank56]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcdd4c78371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank36]: pipeline_state.run_communication() [default3]:[rank35]: recv_activation_tensor = recv_activation() [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: return self._call_impl(*args, **kwargs) [default4]:[rank44]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f87547ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank34]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a30f0b6007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555eeeb65a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank40]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:[rank58]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank56]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcd9c485189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: dist.recv( [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank44]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f871bfba189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank58]: dist.recv( [default5]:[rank37]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555eeeb5e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555eeeb6fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: return func(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank39]: return forward_call(*args, **kwargs) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank44]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f871bfc1610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fcd9c48c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f77f66ab897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank60]: return func(*args, **kwargs) [default5]:[rank61]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank44]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f871bfe0978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: frame #1: + 0x5b3a23e (0x7f78301c823e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f78301c2c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f78301c2f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f78301c3fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7830178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/l[default4]:[rank60]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: pg.recv([tensor], group_src_rank, tag).wait() ib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7830178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f318c01ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #12: + 0x5adc309 (0x7f875479f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank61]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank37]: frame #54: + 0x211239 (0x555eeec32239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #13: + 0x5ae6f10 (0x7f87547a9f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7830178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f318bfd4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a30f0c7c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7830178371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f77f7985189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:[rank61]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank39]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank40]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f77f798c610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f77f79ab978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank59]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f318bfd4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank40]: frame #12: + 0x5adc309 (0x7f783016a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default5]:[rank61]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d836b3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:[rank37]: frame #55: PyObject_Call + 0x207 (0x555eeeb72067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #14: + 0x5ae6fa5 (0x7f87547a9fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f318bfd4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f318bfd4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank36]: recv_activation_tensor = recv_activation() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank40]: frame #13: + 0x5ae6f10 (0x7f7830174f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: return func(*args, **kwargs) [default5]:[rank37]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555eeeb582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank37]: frame #57: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #15: + 0x5124446 (0x7f8753de7446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f023a08d897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:[rank59]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f31537e1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fcd9c4ab978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank56]: frame #12: + 0x5adc309 (0x7fcdd4c6a309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #54: + 0x211239 (0x55a30f18a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: pipeline_state.run_communication() [default4]:[rank44]: frame #16: + 0x1acf4b8 (0x7f87507924b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank37]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555eeeb568fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #14: + 0x5ae6fa5 (0x7f7830174fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #55: PyObject_Call + 0x207 (0x55a30f0ca067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #17: + 0x5aee004 (0x7f87547b1004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank39]: recv_activation_tensor = recv_activation() [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank40]: frame #15: + 0x5124446 (0x7f782f7b2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: frame #18: + 0x5af36b5 (0x7f87547b66b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #59: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #19: + 0xd2631e (0x7f87673a031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank40]: frame #16: + 0x1acf4b8 (0x7f782c15d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a30f0b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #17: + 0x5aee004 (0x7f783017c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #18: + 0x5af36b5 (0x7f78301816b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #1: + 0x5b3a23e (0x7f0273baa23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0273ba4c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank44]: frame #20: + 0x47def4 (0x7f8766af7ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank60]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0273ba4f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank37]: frame #60: PyObject_Call + 0xbc (0x555eeeb71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #21: + 0x1445a6 (0x559db98d35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank59]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f31537e8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #1: + 0x5b3a23e (0x7f0dbd1d023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #57: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559db98cca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #23: + 0x150866 (0x559db98df866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #13: + 0x5ae6f10 (0x7fcdd4c74f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank40]: frame #19: + 0xd2631e (0x7f7842d6b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank59]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f3153807978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank44]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559db98c8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank56]: frame #14: + 0x5ae6fa5 (0x7fcdd4c74fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a30f0ae8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559db98d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #12: + 0x5adc309 (0x7f318bfc6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #59: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank44]: frame #26: PyObject_Call + 0xbc (0x559db98dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559db98c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fed2ffd4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank56]: frame #15: + 0x5124446 (0x7fcdd42b2446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555eeeb582b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559db98d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #13: + 0x5ae6f10 (0x7f318bfd0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f0dbd1cac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #60: PyObject_Call + 0xbc (0x55a30f0c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #20: + 0x47def4 (0x7f78424c2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank56]: frame #16: + 0x1acf4b8 (0x7fcdd0c5d4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank35]: dist.recv( [default4]:[rank44]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0273ba5fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0dbd1caf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: frame #62: + 0x150582 (0x555eeeb71582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #21: + 0x1445a6 (0x560c5dbe45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #17: + 0x5aee004 (0x7fcdd4c7c004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a30f0b02b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560c5dbdda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #30: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0273b5a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0dbd1cbfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: return func(*args, **kwargs) [default0]:[rank40]: frame #23: + 0x150866 (0x560c5dbf0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #18: + 0x5af36b5 (0x7fcdd4c816b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #62: + 0x150582 (0x55a30f0c9582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #32: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #14: + 0x5ae6fa5 (0x7f318bfd0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank40]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560c5dbd9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #1: + 0x5b3a23e (0x7fed69af123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank40]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560c5dbe4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #26: PyObject_Call + 0xbc (0x560c5dbf0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fed69aebc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: frame #63: PyObject_Call + 0xbc (0x55a30f0c9f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #34: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560c5dbd72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0dbd180371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank34]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank37]: frame #63: PyObject_Call + 0xbc (0x555eeeb71f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank35]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560c5dbe4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #19: + 0xd2631e (0x7fcde786b31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank40]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #30: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0273b5a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0dbd180371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #15: + 0x5124446 (0x7f318b60e446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fed69aebf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fed69aecfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:[rank40]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0dbd180371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0273b5a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: dist.recv( [default4]:[rank44]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559db98cbf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #20: + 0x47def4 (0x7fcde6fc2ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: frame #32: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0dbd180371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank44]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559db98ddc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed69aa1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed69aa1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank44]: frame #38: + 0x211239 (0x559db99a0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0273b5a371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank37]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank39]: return func(*args, **kwargs) [default4]:[rank44]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559db98cca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559db98c83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f023b367189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank44]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559db98d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank44]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559db98c3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #34: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank44]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559db98d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560c5dbdcf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed69aa1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #21: + 0x1445a6 (0x55a17b3c15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank44]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #16: + 0x1acf4b8 (0x7f3187fb94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f023b36e610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank60]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f023b38d978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: dist.recv( [default4]:[rank44]: frame #45: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560c5dbeec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a17b3baa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:[rank35]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3186549897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank44]: frame #46: PyObject_Call + 0xbc (0x559db98dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0d8498d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0d84994610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: frame #38: + 0x211239 (0x560c5dcb1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559db98c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560c5dbdda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560c5dbd93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #17: + 0x5aee004 (0x7f318bfd8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #1: + 0x5b3a23e (0x7f31c006623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #48: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #12: + 0x5adc309 (0x7f0273b4c309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: return func(*args, **kwargs) [default4]:[rank36]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank40]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560c5dbe4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #23: + 0x150866 (0x55a17b3cd866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f31c0060c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560c5dbd4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #18: + 0x5af36b5 (0x7f318bfdd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank40]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560c5dbe4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #45: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f0d849b3978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank40]: frame #46: PyObject_Call + 0xbc (0x560c5dbf0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #49: PyObject_Call + 0xbc (0x559db98dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a17b3b6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3268fc4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank39]: frame #1: + 0x5b3a23e (0x7f32a2ae123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560c5dbd72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #48: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: Traceback (most recent call last): [default3]:[rank35]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f31c0060f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559db98c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559db98d3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed69aa1371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #12: + 0x5adc309 (0x7f0dbd172309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #19: + 0xd2631e (0x7f319ebc731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank39]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f32a2adbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559db98cc007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: Traceback (most recent call last): [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank36]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default0]:[rank40]: frame #49: PyObject_Call + 0xbc (0x560c5dbf0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fed312ae189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:[rank35]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f31c0061fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559db98ddc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #13: + 0x5ae6f10 (0x7f0dbd17cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank39]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f32a2adbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560c5dbd72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: frame #13: + 0x5ae6f10 (0x7f0273b56f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31c0016371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560c5dbe4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #54: + 0x211239 (0x559db99a0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fed312b5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:[rank58]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fed312d4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank39]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f32a2adcfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560c5dbdd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #14: + 0x5ae6fa5 (0x7f0dbd17cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f127908a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank44]: frame #55: PyObject_Call + 0x207 (0x559db98e0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #20: + 0x47def4 (0x7f319e31eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31c0016371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560c5dbeec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #54: + 0x211239 (0x560c5dcb1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a17b3c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f32a2a91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #55: PyObject_Call + 0x207 (0x560c5dbf1067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559db98c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #57: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #12: + 0x5adc309 (0x7fed69a93309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: trainer.train(dataloader) [default4]:[rank60]: frame #14: + 0x5ae6fa5 (0x7f0273b56fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #15: + 0x5124446 (0x7f0273194446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f32a2a91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31c0016371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #1: + 0x5b3a23e (0x7f12b2ba723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559db98c48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f32a2a91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #59: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f12b2ba1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #60: PyObject_Call + 0xbc (0x559db98dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank44]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559db98c62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31c0016371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank44]: frame #62: + 0x150582 (0x559db98df582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f32a2a91371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560c5dbd72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3187823189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: frame #63: PyObject_Call + 0xbc (0x559db98dff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #16: + 0x1acf4b8 (0x7f026fb3f4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f12b2ba1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #57: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: trainer.train(dataloader) [default7]:[rank39]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f326a29e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank44]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:[rank59]: frame #21: + 0x1445a6 (0x5598b03535a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #26: PyObject_Call + 0xbc (0x55a17b3cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a17b3b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f318782a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560c5dbd58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #59: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a17b3c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #15: + 0x5124446 (0x7f0dbc7ba446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f326a2a5610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f12b2ba2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank40]: frame #60: PyObject_Call + 0xbc (0x560c5dbf0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560c5dbd72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: frame #62: + 0x150582 (0x560c5dbf0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5598b034ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f3187849978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:[rank40]: frame #63: PyObject_Call + 0xbc (0x560c5dbf0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank40]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank39]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f326a2c4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: Traceback (most recent call last): [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank60]: frame #17: + 0x5aee004 (0x7f0273b5e004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #12: + 0x5adc309 (0x7f32a2a83309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: trainer.train(dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank60]: frame #18: + 0x5af36b5 (0x7f0273b636b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #13: + 0x5ae6f10 (0x7f32a2a8df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank63]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #12: + 0x5adc309 (0x7f31c0008309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #13: + 0x5ae6f10 (0x7f31c0012f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank46]: output = model(**micro_batch) [default5]:[rank61]: frame #16: + 0x1acf4b8 (0x7f0db91654b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f12b2b57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default3]:[rank59]: frame #23: + 0x150866 (0x5598b035f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f12b2b57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank46]: sharded_logits = self.model( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: frame #19: + 0xd2631e (0x7f028674d31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank56]: frame #30: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f12b2b57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f12b2b57371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank46]: return forward_call(*args, **kwargs) [default6]:[rank62]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank56]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #32: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f127a364189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank36]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f127a36b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank46]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank46]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank56]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f127a38a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank46]: return self._call_impl(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank59]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5598b0348142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #20: + 0x47def4 (0x7f0285ea4ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #12: + 0x5adc309 (0x7f12b2b49309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #14: + 0x5ae6fa5 (0x7f31c0012fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: return forward_call(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank46]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank58]: frame #13: + 0x5ae6f10 (0x7fed69a9df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #14: + 0x5ae6fa5 (0x7f32a2a8dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #15: + 0x5124446 (0x7f32a20cb446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank46]: pipeline_state.run_communication() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank59]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5598b0353a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #15: + 0x5124446 (0x7f31bf650446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: recv_activation_tensor = recv_activation() [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank46]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank59]: frame #26: PyObject_Call + 0xbc (0x5598b035ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #13: + 0x5ae6f10 (0x7f12b2b53f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank46]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank59]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5598b03462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #16: + 0x1acf4b8 (0x7f31bbffb4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:[rank46]: dist.recv( [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank39]: frame #16: + 0x1acf4b8 (0x7f329ea764b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: return func(*args, **kwargs) [default6]:[rank46]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank46]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank35]: frame #17: + 0x5aee004 (0x7f31c001a004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #18: + 0x5af36b5 (0x7f31c001f6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank46]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default6]:[rank46]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa4a5db6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank38]: Traceback (most recent call last): [default4]:[rank36]: frame #14: + 0x5ae6fa5 (0x7f12b2b53fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #15: + 0x5124446 (0x7f12b2191446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #16: + 0x1acf4b8 (0x7f12aeb3c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #1: + 0x5b3a23e (0x7fa4df8d323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fa4df8cdc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: frame #19: + 0xd2631e (0x7f31d2c0931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:[rank35]: frame #20: + 0x47def4 (0x7f31d2360ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: trainer.train(dataloader) [default6]:[rank46]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa4df8cdf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa4df8cefd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4df883371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #17: + 0x5aee004 (0x7f32a2a95004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #18: + 0x5af36b5 (0x7f32a2a9a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4df883371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4df883371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa4df883371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa4a7090189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:[rank61]: frame #17: + 0x5aee004 (0x7f0dbd184004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank46]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fa4a7097610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fa4a70b6978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank46]: frame #12: + 0x5adc309 (0x7fa4df875309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #13: + 0x5ae6f10 (0x7fa4df87ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #18: + 0x5af36b5 (0x7f0dbd1896b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank36]: frame #17: + 0x5aee004 (0x7f12b2b5b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #14: + 0x5ae6fa5 (0x7fa4df87ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #15: + 0x5124446 (0x7fa4deebd446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #19: + 0xd2631e (0x7f0dcfd7331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #18: + 0x5af36b5 (0x7f12b2b606b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #16: + 0x1acf4b8 (0x7fa4db8684b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #17: + 0x5aee004 (0x7fa4df887004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: frame #19: + 0xd2631e (0x7f12c574a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #20: + 0x47def4 (0x7f12c4ea1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #18: + 0x5af36b5 (0x7fa4df88c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank46]: frame #19: + 0xd2631e (0x7fa4f247631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #20: + 0x47def4 (0x7fa4f1bcdef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank60]: frame #21: + 0x1445a6 (0x55b105cfc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #21: + 0x1445a6 (0x560c0b3a35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #19: + 0xd2631e (0x7f32b568431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #21: + 0x1445a6 (0x55fdc9de65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55fdc9ddfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #23: + 0x150866 (0x55fdc9df2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5598b0353a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank46]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55fdc9ddb142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55fdc9de6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #14: + 0x5ae6fa5 (0x7fed69a9dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #21: + 0x1445a6 (0x5584860395a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #26: PyObject_Call + 0xbc (0x55fdc9df2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55fdc9dd92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55fdc9de6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank39]: frame #20: + 0x47def4 (0x7f32b4ddbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank46]: frame #30: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #32: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #20: + 0x47def4 (0x7f0dcf4caef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:[rank61]: frame #21: + 0x1445a6 (0x5603881b35a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560c0b39ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #34: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5603881aca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank46]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55fdc9ddef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55fdc9df0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: frame #22: _PyObject_MakeTpCall + 0x26b (0x558486032a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #38: + 0x211239 (0x55fdc9eb3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55fdc9ddfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55fdc9ddb3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b105cf5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #23: + 0x150866 (0x560c0b3af866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55fdc9de6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55fdc9dd6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55fdc9de6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank39]: frame #21: + 0x1445a6 (0x56047a8635a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #45: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #46: PyObject_Call + 0xbc (0x55fdc9df2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #15: + 0x5124446 (0x7fed690db446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank39]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56047a85ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55fdc9dd92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #48: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #49: PyObject_Call + 0xbc (0x55fdc9df2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank36]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560c0b398142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55fdc9dd92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55fdc9de6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55fdc9ddf007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #23: + 0x150866 (0x5603881bf866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #23: + 0x150866 (0x558486045866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #23: + 0x150866 (0x56047a86f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank46]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55fdc9df0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #54: + 0x211239 (0x55fdc9eb3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #55: PyObject_Call + 0x207 (0x55fdc9df3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank36]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560c0b3a3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55fdc9dd92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #57: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #16: + 0x1acf4b8 (0x7fed65a864b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #30: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #26: PyObject_Call + 0xbc (0x560c0b3aff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55fdc9dd78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #59: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #60: PyObject_Call + 0xbc (0x55fdc9df2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5603881a8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #34: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank46]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55fdc9dd92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #62: + 0x150582 (0x55fdc9df2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: frame #63: PyObject_Call + 0xbc (0x55fdc9df2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank46]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: frame #23: + 0x150866 (0x55b105d08866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank59]: frame #32: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56047a858142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #17: + 0x5aee004 (0x7fed69aa5004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55848602e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56047a863a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #26: PyObject_Call + 0xbc (0x56047a86ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: output = model(**micro_batch) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank35]: frame #25: _PyFunction_Vectorcall + 0x6c (0x558486039a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default6]:[rank62]: output = model(**micro_batch) [default7]:[rank39]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56047a8562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56047a863a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #18: + 0x5af36b5 (0x7fed69aaa6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a17b3b9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default3]:[rank35]: frame #26: PyObject_Call + 0xbc (0x558486045f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #34: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #30: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #19: + 0xd2631e (0x7fed7c69431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:[rank56]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a17b3cbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560c0b3962b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank35]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55848602c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #28: _PyFunction_Vectorcall + 0x6c (0x558486039a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5603881b3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560c0b3a3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: Traceback (most recent call last): [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank52]: trainer.train(dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank52]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank52]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: frame #32: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #38: + 0x211239 (0x55a17b48e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #26: PyObject_Call + 0xbc (0x5603881bff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return forward_call(*args, **kwargs) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank36]: frame #30: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #20: + 0x47def4 (0x7fed7bdebef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:[rank58]: frame #21: + 0x1445a6 (0x55688e8675a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #30: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank52]: output = model(**micro_batch) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank52]: sharded_logits = self.model( [default2]:[rank58]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55688e860a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank52]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank52]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default3]:[rank59]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5598b034bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank38]: sharded_logits = self.model( cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank52]: return forward_call(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank59]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5598b035dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a17b3baa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #32: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank52]: pipeline_state.run_communication() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank62]: return forward_call(*args, **kwargs) [default7]:[rank39]: frame #34: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: recv_activation_tensor = recv_activation() [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank52]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank60]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b105cf1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank52]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank52]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank60]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b105cfca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5603881a62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #34: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]:[rank52]: dist.recv( [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank58]: frame #23: + 0x150866 (0x55688e873866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #32: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: return func(*args, **kwargs) [default4]:[rank52]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank52]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank52]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:[rank52]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default0]:[rank56]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a17b3b63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f30ce22f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:[rank52]: frame #1: + 0x5b3a23e (0x7f3107d4c23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank39]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56047a85bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f3107d46c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #38: + 0x211239 (0x5598b0420239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default4]:[rank52]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3107d46f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3107d47fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3107cfc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: frame #26: PyObject_Call + 0xbc (0x55b105d08f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3107cfc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3107cfc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3107cfc371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f30cf509189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: sharded_logits = self.model( [default7]:[rank63]: output = model(**micro_batch) [default4]:[rank36]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f30cf510610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f30cf52f978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:[rank52]: frame #12: + 0x5adc309 (0x7f3107cee309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5603881b3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560c0b39bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: return forward_call(*args, **kwargs) [default4]:[rank52]: frame #13: + 0x5ae6f10 (0x7f3107cf8f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560c0b3adc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56047a86dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #34: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #14: + 0x5ae6fa5 (0x7f3107cf8fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #15: + 0x5124446 (0x7f3107336446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #16: + 0x1acf4b8 (0x7f3103ce14b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #17: + 0x5aee004 (0x7f3107d00004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank35]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558486031f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #18: + 0x5af36b5 (0x7f3107d056b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank52]: frame #19: + 0xd2631e (0x7f311a8ef31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank52]: frame #20: + 0x47def4 (0x7f311a046ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:[rank36]: frame #38: + 0x211239 (0x560c0b470239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #21: + 0x1445a6 (0x563522f265a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #22: _PyObject_MakeTpCall + 0x26b (0x563522f1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #23: + 0x150866 (0x563522f32866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #37: _PyObject_Call_Prepend + 0x69 (0x558486043c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: Traceback (most recent call last): [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank47]: trainer.train(dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank47]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank47]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank47]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default4]:[rank52]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563522f1b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #25: _PyFunction_Vectorcall + 0x6c (0x563522f26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #26: PyObject_Call + 0xbc (0x563522f32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563522f192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #28: _PyFunction_Vectorcall + 0x6c (0x563522f26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #30: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cl[default7]:[rank39]: frame #38: + 0x211239 (0x56047a930239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56047a85ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) uster/bin/python3.10) [default2]:[rank58]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55688e85c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55688e867a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a17b3c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #38: + 0x211239 (0x558486106239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #32: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #34: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563522f1ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #37: _PyObject_Call_Prepend + 0x69 (0x563522f30c39 in /fsx/ferdinandmom/miniforge3/en[default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward vs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #38: + 0x211239 (0x563522ff3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #39: _PyObject_MakeTpCall + 0x26b (0x563522f1fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563522f1b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #41: _PyFunction_Vectorcall + 0x6c (0x563522f26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563522f16c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #43: _PyFunction_Vectorcall + 0x6c (0x563522f26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /f[default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560c0b39ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) sx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #45: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default4]:[rank60]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b105cef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #30: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560c0b3983e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560c0b3a3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #46: PyObject_Call + 0xbc (0x563522f32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563522f192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #48: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #49: PyObject_Call + 0xbc (0x563522f32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563522f192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #51: _PyFunction_Vectorcall + 0x6c (0x563522f26a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563522f1f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/[default2]:[rank58]: frame #26: PyObject_Call + 0xbc (0x55688e873f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a17b3b1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56047a8583e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) bin/python3.10) [default4]:[rank60]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b105cfca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default7]:[rank39]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56047a863a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank52]: frame #53: _PyObject_Call_Prepend + 0x69 (0x563522f30c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #54: + 0x211239 (0x563522ff3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #55: PyObject_Call + 0x207 (0x563522f33067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563522f192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #57: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank35]: frame #39: _PyObject_MakeTpCall + 0x26b (0x558486032a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563522f178fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #59: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #60: PyObject_Call + 0xbc (0x563522f32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563522f192b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a17b3c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55848602e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank47]: output = model(**micro_batch) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank47]: sharded_logits = self.model( [default4]:[rank52]: frame #62: + 0x150582 (0x563522f32582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: frame #63: PyObject_Call + 0xbc (0x563522f32f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank52]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #32: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank47]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank47]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-[default3]:[rank59]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5598b034ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560c0b393c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank47]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: return forward_call(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank47]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank47]: pipeline_state.run_communication() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communi[default6]:[rank62]: return forward_call(*args, **kwargs) [default4]:[rank36]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560c0b3a3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) cation [default7]:[rank47]: recv_activation_tensor = recv_activation() [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank47]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank47]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank56]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank47]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:[rank47]: dist.recv( [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank47]: return func(*args, **kwargs) [default7]:[rank47]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank47]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank47]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank47]: Exception raised from recvBytes at ../torch/csr[default5]:[rank61]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #41: _PyFunction_Vectorcall + 0x6c (0x558486039a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) c/distributed/c10d/Utils.hpp:672 (most recent call first): [default7]:[rank47]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fed7abbe897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:[rank47]: frame #1: + 0x5b3a23e (0x7fedb46db23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7fedb46d5c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fedb46d5f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fedb46[default3]:[rank59]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5598b03483e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x558486029c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) d6fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb468b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb468b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb468b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb468b371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLI[default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank39]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56047a853c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) D(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fed7be98189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fed7be9f610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7fed7bebe978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank47]: frame #12: + 0x5adc309 (0x7fedb467d309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #13: + 0x5ae6f10 (0x7fedb4687f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-[default4]:[rank60]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #34: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #14: + 0x5ae6fa5 (0x7fedb4687fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #15: + 0x5124446 (0x7fedb3cc5446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #16: + 0x1acf4b8 (0x7fedb06704b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #17: + 0x5aee004 (0x7fedb468f004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank36]: frame #45: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #18: + 0x5af36b5 (0x7fedb46946b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank47]: frame #19: + 0xd2631e (0x7fedc727e31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank47]: frame #20: + 0x47def4 (0x7fedc69d5ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank47]: frame #21: + 0x1445a6 (0x560341ea05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #45: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #30: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #43: _PyFunction_Vectorcall + 0x6c (0x558486039a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560341e99a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #23: + 0x150866 (0x560341eac866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560341e95142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560341ea0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return forward_call(*args, **kwargs) [default6]:[rank38]: return self._call_impl(*args, **kwargs) [default7]:[rank47]: frame #26: PyObject_Call + 0xbc (0x560341eacf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560341e932b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560341ea0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55688e85a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank47]: frame #30: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #32: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55688e867a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #34: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560341e98f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank35]: frame #45: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560341eaac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #38: + 0x211239 (0x560341f6d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560341e99a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560341e953e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return forward_call(*args, **kwargs) [default7]:[rank47]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560341ea0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560341e90c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560341ea0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5603881abf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #46: PyObject_Call + 0xbc (0x560c0b3aff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #45: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #46: PyObject_Call + 0xbc (0x560341eacf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560341e932b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #48: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5598b0353a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560c0b3962b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #49: PyObject_Call + 0xbc (0x560341eacf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560341e932b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560341ea0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560341e99007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560341eaac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #54: + 0x211239 (0x560341f6d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #55: PyObject_Call + 0x207 (0x560341ead067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank58]: frame #30: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward /python3.10) [default7]:[rank47]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560341e932b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #57: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560341e918fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #46: PyObject_Call + 0xbc (0x55a17b3cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #46: PyObject_Call + 0xbc (0x558486045f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #59: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #60: PyObject_Call + 0xbc (0x560341eacf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560341e932b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #62: + 0x150582 (0x560341eac582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: frame #63: PyObject_Call + 0xbc (0x560341eacf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank47]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5603881bdc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55848602c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #48: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56047a863a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a17b3b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank56]: frame #48: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #38: + 0x211239 (0x560388280239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank36]: frame #48: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #32: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #49: PyObject_Call + 0xbc (0x560c0b3aff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: sharded_logits = self.model( [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank35]: frame #49: PyObject_Call + 0xbc (0x558486045f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55848602c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: pipeline_state.run_communication() [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: recv_activation_tensor = recv_activation() [default5]:[rank61]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5603881aca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #45: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #49: PyObject_Call + 0xbc (0x55a17b3cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank60]: frame #32: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #51: _PyFunction_Vectorcall + 0x6c (0x558486039a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5598b0343c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558486032007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5603881a83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank56]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a17b3b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #46: PyObject_Call + 0xbc (0x56047a86ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56047a8562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5598b0353a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank61]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5603881b3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #53: _PyObject_Call_Prepend + 0x69 (0x558486043c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #48: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #34: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560c0b3962b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560c0b3a3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #49: PyObject_Call + 0xbc (0x56047a86ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5603881a3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank58]: frame #34: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank60]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56047a8562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a17b3c1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56047a863a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5603881b3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank58]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #54: + 0x211239 (0x558486106239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #45: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #55: PyObject_Call + 0x207 (0x558486046067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560c0b39c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank36]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560c0b3adc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56047a85c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:[rank61]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: dist.recv( [default2]:[rank58]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55688e85ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56047a86dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #46: PyObject_Call + 0xbc (0x5598b035ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55848602c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank39]: frame #54: + 0x211239 (0x56047a930239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank60]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b105cf4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: return func(*args, **kwargs) [default2]:[rank58]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55688e871c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #57: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5598b03462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55848602a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #59: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #45: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank61]: frame #46: PyObject_Call + 0xbc (0x5603881bff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #54: + 0x211239 (0x560c0b470239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #60: PyObject_Call + 0xbc (0x558486045f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5603881a62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #38: + 0x211239 (0x55688e934239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55848602c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #62: + 0x150582 (0x558486045582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return self._call_impl(*args, **kwargs) [default6]:[rank38]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank56]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a17b3ba007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #55: PyObject_Call + 0x207 (0x56047a870067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: frame #63: PyObject_Call + 0xbc (0x558486045f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #48: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #49: PyObject_Call + 0xbc (0x5603881bff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default6]:[rank38]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default3]:[rank59]: frame #48: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank35]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b105d06c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #38: + 0x211239 (0x55b105dc9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56047a8562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a17b3cbc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f98fe4b9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank38]: frame #1: + 0x5b3a23e (0x7f9937fd623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #49: PyObject_Call + 0xbc (0x5598b035ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #57: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5603881a62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #55: PyObject_Call + 0x207 (0x560c0b3b0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56047a8548fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f9937fd0c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5603881b3a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #59: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #54: + 0x211239 (0x55a17b48e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5603881ac007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #60: PyObject_Call + 0xbc (0x56047a86ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56047a8562b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5603881bdc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #62: + 0x150582 (0x56047a86f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #54: + 0x211239 (0x560388280239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55688e860a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9937fd0f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5598b03462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: frame #63: PyObject_Call + 0xbc (0x56047a86ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank39]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b105cf5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9937fd1fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9937f86371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9937f86371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55688e85c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9937f86371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:[rank59]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5598b0353a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9937f86371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #55: PyObject_Call + 0x207 (0x55a17b3ce067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f98ff793189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f98ff79a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: return forward_call(*args, **kwargs) [default6]:[rank38]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f98ff7b9978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank38]: frame #12: + 0x5adc309 (0x7f9937f78309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:[rank61]: frame #55: PyObject_Call + 0x207 (0x5603881c0067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b105cf13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #13: + 0x5ae6f10 (0x7f9937f82f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:[rank56]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a17b3b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560c0b3962b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank62]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank36]: frame #57: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560c0b3948fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5603881a62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #59: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #57: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #57: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #60: PyObject_Call + 0xbc (0x560c0b3aff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #14: + 0x5ae6fa5 (0x7f9937f82fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #15: + 0x5124446 (0x7f99375c0446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank36]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560c0b3962b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: frame #62: + 0x150582 (0x560c0b3af582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: pipeline_state.run_communication() [default4]:[rank36]: frame #63: PyObject_Call + 0xbc (0x560c0b3aff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5603881a48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b105cfca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank36]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:[rank58]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55688e867a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #16: + 0x1acf4b8 (0x7f9933f6b4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #17: + 0x5aee004 (0x7f9937f8a004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #18: + 0x5af36b5 (0x7f9937f8f6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank38]: frame #19: + 0xd2631e (0x7f994ab7931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank38]: frame #20: + 0x47def4 (0x7f994a2d0ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank38]: frame #21: + 0x1445a6 (0x561b0091a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561b00913a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a17b3b28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #23: + 0x150866 (0x561b00926866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561b0090f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561b0091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #26: PyObject_Call + 0xbc (0x561b00926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5598b034c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561b0090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561b0091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #30: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank38]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #32: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #34: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561b00912f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561b00924c39 in /fsx/ferdinandmom/miniforge3/en[default6]:[rank62]: recv_activation_tensor = recv_activation() [default4]:[rank60]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b105cecc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #59: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) vs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #38: + 0x211239 (0x561b009e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561b00913a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5598b035dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561b0090f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561b0091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561b0090ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561b0091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return forward_call(*args, **kwargs) [default6]:[rank38]: frame #45: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #46: PyObject_Call + 0xbc (0x561b00926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561b0090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #48: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank38]: frame #49: PyObject_Call + 0xbc (0x561b00926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561b0090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561b0091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561b00913007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b105cfca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561b00924c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #54: + 0x211239 (0x561b009e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #55: PyObject_Call + 0x207 (0x561b00927067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561b0090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank38]: frame #57: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561b0090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #59: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #60: PyObject_Call + 0xbc (0x561b00926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561b0090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #62: + 0x150582 (0x561b00926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank38]: frame #63: PyObject_Call + 0xbc (0x561b00926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyt[default6]:[rank62]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] hon3.10) [default6]:[rank38]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:[rank61]: frame #59: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #60: PyObject_Call + 0xbc (0x55a17b3cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank59]: frame #54: + 0x211239 (0x5598b0420239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #45: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a17b3b42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55688e857c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #55: PyObject_Call + 0x207 (0x5598b0360067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #60: PyObject_Call + 0xbc (0x5603881bff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank59]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5598b03462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #62: + 0x150582 (0x55a17b3cd582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55688e867a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank60]: frame #46: PyObject_Call + 0xbc (0x55b105d08f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: frame #63: PyObject_Call + 0xbc (0x55a17b3cdf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #57: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5603881a62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:[rank56]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:[rank63]: return self._call_impl(*args, **kwargs) [default3]:[rank59]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5598b03448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b105cef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #62: + 0x150582 (0x5603881bf582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #45: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #59: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:[rank61]: frame #63: PyObject_Call + 0xbc (0x5603881bff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank59]: frame #60: PyObject_Call + 0xbc (0x5598b035ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5598b03462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #46: PyObject_Call + 0xbc (0x55688e873f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank61]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:[rank60]: frame #48: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55688e85a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #62: + 0x150582 (0x5598b035f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #49: PyObject_Call + 0xbc (0x55b105d08f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #48: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: frame #63: PyObject_Call + 0xbc (0x5598b035ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return forward_call(*args, **kwargs) [default2]:[rank58]: frame #49: PyObject_Call + 0xbc (0x55688e873f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:[rank59]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank60]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b105cef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55688e85a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank58]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55688e867a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank62]: dist.recv( [default2]:[rank58]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55688e860007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank60]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b105cfca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55688e871c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank58]: frame #54: + 0x211239 (0x55688e934239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: pipeline_state.run_communication() [default4]:[rank60]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b105cf5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank63]: recv_activation_tensor = recv_activation() [default4]:[rank60]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b105d06c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank58]: frame #55: PyObject_Call + 0x207 (0x55688e874067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #54: + 0x211239 (0x55b105dc9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55688e85a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #55: PyObject_Call + 0x207 (0x55b105d09067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank60]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b105cef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank58]: frame #57: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55688e8588fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:[rank58]: frame #59: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: return func(*args, **kwargs) [default4]:[rank60]: frame #57: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: dist.recv( [default2]:[rank58]: frame #60: PyObject_Call + 0xbc (0x55688e873f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:[rank60]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b105ced8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank63]: return func(*args, **kwargs) [default7]:[rank63]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: frame #59: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank60]: frame #60: PyObject_Call + 0xbc (0x55b105d08f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:[rank58]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55688e85a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]:[rank63]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default2]:[rank58]: frame #62: + 0x150582 (0x55688e873582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank60]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b105cef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1798259897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:[rank58]: frame #63: PyObject_Call + 0xbc (0x55688e873f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank60]: frame #62: + 0x150582 (0x55b105d08582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #1: + 0x5b3a23e (0x7f17d1d7623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:[rank58]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:[rank62]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): [default4]:[rank60]: frame #63: PyObject_Call + 0xbc (0x55b105d08f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f17d1d70c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:[rank60]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f12cf750897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:[rank62]: frame #1: + 0x5b3a23e (0x7f130926d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f17d1d70f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f17d1d71fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17d1d26371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17d1d26371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17d1d26371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f17d1d26371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1799533189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f179953a610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f1799559978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #12: + 0x5adc309 (0x7f17d1d18309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #13: + 0x5ae6f10 (0x7f17d1d22f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x2c7 (0x7f1309267c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1309267f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #14: + 0x5ae6fa5 (0x7f17d1d22fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1309268fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #15: + 0x5124446 (0x7f17d1360446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #16: + 0x1acf4b8 (0x7f17cdd0b4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #17: + 0x5aee004 (0x7f17d1d2a004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #18: + 0x5af36b5 (0x7f17d1d2f6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #19: + 0xd2631e (0x7f17e491931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: frame #20: + 0x47def4 (0x7f17e4070ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:[rank62]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f130921d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f130921d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:[rank62]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f130921d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #21: + 0x1445a6 (0x560e5b8845a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #22: _PyObject_MakeTpCall + 0x26b (0x560e5b87da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f130921d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #23: + 0x150866 (0x560e5b890866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f12d0a2a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:[rank63]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560e5b879142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f12d0a31610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x5f8 (0x7f12d0a50978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:[rank62]: frame #12: + 0x5adc309 (0x7f130920f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560e5b884a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #26: PyObject_Call + 0xbc (0x560e5b890f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #13: + 0x5ae6f10 (0x7f1309219f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560e5b8772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560e5b884a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #14: + 0x5ae6fa5 (0x7f1309219fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #15: + 0x5124446 (0x7f1308857446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #30: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #16: + 0x1acf4b8 (0x7f13052024b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #17: + 0x5aee004 (0x7f1309221004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #32: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #18: + 0x5af36b5 (0x7f13092266b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:[rank63]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #19: + 0xd2631e (0x7f131be1031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: frame #34: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #20: + 0x47def4 (0x7f131b567ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:[rank63]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #21: + 0x1445a6 (0x55c17e9d75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55c17e9d0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #23: + 0x150866 (0x55c17e9e3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c17e9cc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55c17e9d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560e5b87cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560e5b88ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #38: + 0x211239 (0x560e5b951239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #26: PyObject_Call + 0xbc (0x55c17e9e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #39: _PyObject_MakeTpCall + 0x26b (0x560e5b87da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c17e9ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55c17e9d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560e5b8793e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560e5b884a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560e5b874c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560e5b884a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #45: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #46: PyObject_Call + 0xbc (0x560e5b890f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560e5b8772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #30: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #48: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #32: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #49: PyObject_Call + 0xbc (0x560e5b890f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560e5b8772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560e5b884a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560e5b87d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560e5b88ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #54: + 0x211239 (0x560e5b951239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #55: PyObject_Call + 0x207 (0x560e5b891067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #34: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560e5b8772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c17e9cff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #57: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560e5b8758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #59: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55c17e9e1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #60: PyObject_Call + 0xbc (0x560e5b890f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #38: + 0x211239 (0x55c17eaa4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560e5b8772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55c17e9d0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #62: + 0x150582 (0x560e5b890582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c17e9cc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: frame #63: PyObject_Call + 0xbc (0x560e5b890f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55c17e9d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:[rank63]: . This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:[rank62]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c17e9c7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55c17e9d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #45: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #46: PyObject_Call + 0xbc (0x55c17e9e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c17e9ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #48: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #49: PyObject_Call + 0xbc (0x55c17e9e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c17e9ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55c17e9d7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c17e9d0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55c17e9e1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #54: + 0x211239 (0x55c17eaa4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #55: PyObject_Call + 0x207 (0x55c17e9e4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c17e9ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #57: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c17e9c88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #59: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #60: PyObject_Call + 0xbc (0x55c17e9e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c17e9ca2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #62: + 0x150582 (0x55c17e9e3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: frame #63: PyObject_Call + 0xbc (0x55c17e9e3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:[rank62]: . This may indicate a possible application crash on rank 0 or a network set up issue. W0703 01:09:55.583000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1780472 closing signal SIGTERM W0703 01:09:55.583000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1780473 closing signal SIGTERM W0703 01:09:55.584000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1780474 closing signal SIGTERM W0703 01:09:55.584000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1780476 closing signal SIGTERM W0703 01:09:55.584000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1780477 closing signal SIGTERM E0703 01:09:56.704000 140623575447360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1780470) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_01:09:55 host : ip-26-0-160-225.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 1780471) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_01:09:55 host : ip-26-0-160-225.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1780475) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_01:09:55 host : ip-26-0-160-225.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 1780470) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-160-225: task 0: Exited with exit code 1 W0703 01:09:59.564000 140352042612480 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_883814_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.331000 139837175351040 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-103.ec2.internal_870399_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.447000 140002146195200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-78.ec2.internal_1145753_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.451000 140505445222144 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-153.ec2.internal_1421638_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.460000 140371262560000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-138.ec2.internal_666542_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.509000 139723741435648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3894858_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.515000 140132543833856 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-102.ec2.internal_3766278_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:00.588000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766349 closing signal SIGTERM W0703 01:10:00.588000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766351 closing signal SIGTERM W0703 01:10:00.588000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766352 closing signal SIGTERM W0703 01:10:00.590000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766353 closing signal SIGTERM W0703 01:10:00.590000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766354 closing signal SIGTERM W0703 01:10:00.589000 140357703345984 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 883882 closing signal SIGTERM W0703 01:10:00.590000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766355 closing signal SIGTERM W0703 01:10:00.589000 140357703345984 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 883886 closing signal SIGTERM W0703 01:10:00.592000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421707 closing signal SIGTERM W0703 01:10:00.592000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421708 closing signal SIGTERM W0703 01:10:00.592000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421709 closing signal SIGTERM W0703 01:10:00.592000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766356 closing signal SIGTERM W0703 01:10:00.592000 140138204567360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3766357 closing signal SIGTERM W0703 01:10:00.592000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421710 closing signal SIGTERM W0703 01:10:00.593000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421711 closing signal SIGTERM W0703 01:10:00.593000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421712 closing signal SIGTERM W0703 01:10:00.591000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666611 closing signal SIGTERM W0703 01:10:00.592000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666612 closing signal SIGTERM W0703 01:10:00.592000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666613 closing signal SIGTERM W0703 01:10:00.594000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145822 closing signal SIGTERM W0703 01:10:00.594000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145823 closing signal SIGTERM W0703 01:10:00.594000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145824 closing signal SIGTERM W0703 01:10:00.593000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870468 closing signal SIGTERM W0703 01:10:00.594000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145825 closing signal SIGTERM W0703 01:10:00.592000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666614 closing signal SIGTERM W0703 01:10:00.593000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666615 closing signal SIGTERM W0703 01:10:00.593000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666616 closing signal SIGTERM W0703 01:10:00.595000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145826 closing signal SIGTERM W0703 01:10:00.596000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421713 closing signal SIGTERM W0703 01:10:00.593000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870469 closing signal SIGTERM W0703 01:10:00.593000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870470 closing signal SIGTERM W0703 01:10:00.594000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870471 closing signal SIGTERM W0703 01:10:00.597000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145827 closing signal SIGTERM W0703 01:10:00.596000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894927 closing signal SIGTERM W0703 01:10:00.597000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145828 closing signal SIGTERM W0703 01:10:00.597000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894928 closing signal SIGTERM W0703 01:10:00.597000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894929 closing signal SIGTERM W0703 01:10:00.597000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870472 closing signal SIGTERM W0703 01:10:00.596000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666617 closing signal SIGTERM W0703 01:10:00.597000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870473 closing signal SIGTERM W0703 01:10:00.596000 140376923293504 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 666618 closing signal SIGTERM W0703 01:10:00.597000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870474 closing signal SIGTERM W0703 01:10:00.596000 140511105955648 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1421714 closing signal SIGTERM W0703 01:10:00.597000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894930 closing signal SIGTERM W0703 01:10:00.598000 140007806928704 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1145829 closing signal SIGTERM W0703 01:10:00.597000 139842836084544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 870475 closing signal SIGTERM W0703 01:10:00.598000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894931 closing signal SIGTERM W0703 01:10:00.600000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894932 closing signal SIGTERM W0703 01:10:00.600000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894933 closing signal SIGTERM W0703 01:10:00.602000 139729402169152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3894934 closing signal SIGTERM E0703 01:10:01.010000 140357703345984 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 883883) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 01:10:01.015000 140357703345984 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_883814_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:01.042000 140357703345984 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_883814_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:01.066000 140357703345984 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_883814_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 58 (local_rank: 2) exitcode : 1 (pid: 883884) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 59 (local_rank: 3) exitcode : 1 (pid: 883885) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 61 (local_rank: 5) exitcode : 1 (pid: 883887) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 62 (local_rank: 6) exitcode : 1 (pid: 883888) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 63 (local_rank: 7) exitcode : 1 (pid: 883889) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_01:10:00 host : ip-26-0-171-88.ec2.internal rank : 57 (local_rank: 1) exitcode : 1 (pid: 883883) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-171-88: task 6: Exited with exit code 1 W0703 01:10:05.335000 139837175351040 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-103.ec2.internal_870399_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:05.451000 140002146195200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-78.ec2.internal_1145753_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:05.455000 140505445222144 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-153.ec2.internal_1421638_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:05.465000 140371262560000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-138.ec2.internal_666542_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:05.514000 139723741435648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-62.ec2.internal_3894858_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:05.520000 140132543833856 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-102.ec2.internal_3766278_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:06.933000 139729402169152 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3894858_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:06.948000 139729402169152 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-62.ec2.internal_3894858_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-171-62: task 5: Exited with exit code 1 W0703 01:10:07.624000 140138204567360 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-102.ec2.internal_3766278_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:07.635000 140138204567360 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-102.ec2.internal_3766278_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-171-102: task 7: Exited with exit code 1 W0703 01:10:10.339000 139837175351040 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-103.ec2.internal_870399_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:10.455000 140002146195200 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-78.ec2.internal_1145753_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:10.459000 140505445222144 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-153.ec2.internal_1421638_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:10.469000 140371262560000 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-161-138.ec2.internal_666542_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:13.333000 140007806928704 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-78.ec2.internal_1145753_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:13.343000 140007806928704 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-78.ec2.internal_1145753_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 01:10:13.537000 140511105955648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-153.ec2.internal_1421638_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:13.548000 140511105955648 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-153.ec2.internal_1421638_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-161-78: task 1: Exited with exit code 1 srun: error: ip-26-0-161-153: task 4: Exited with exit code 1 W0703 01:10:15.138000 139842836084544 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-103.ec2.internal_870399_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:15.148000 139842836084544 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-103.ec2.internal_870399_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 01:10:15.238000 140376923293504 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-138.ec2.internal_666542_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 01:10:15.246000 140376923293504 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-161-138.ec2.internal_666542_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-161-103: task 2: Exited with exit code 1 srun: error: ip-26-0-161-138: task 3: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.