======================== START TIME: Wed Jul 3 07:44:04 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 07:44:10.899000 140418890327872 torch/distributed/run.py:757] W0703 07:44:10.899000 140418890327872 torch/distributed/run.py:757] ***************************************** W0703 07:44:10.899000 140418890327872 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:10.899000 140418890327872 torch/distributed/run.py:757] ***************************************** W0703 07:44:10.980000 139721385899840 torch/distributed/run.py:757] W0703 07:44:10.980000 139721385899840 torch/distributed/run.py:757] ***************************************** W0703 07:44:10.980000 139721385899840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:10.980000 139721385899840 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.149000 140434965161792 torch/distributed/run.py:757] W0703 07:44:11.149000 140434965161792 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.149000 140434965161792 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.149000 140434965161792 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.162000 140516387739456 torch/distributed/run.py:757] W0703 07:44:11.162000 140516387739456 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.162000 140516387739456 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.162000 140516387739456 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.222000 139708123027264 torch/distributed/run.py:757] W0703 07:44:11.222000 139708123027264 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.222000 139708123027264 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.222000 139708123027264 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.244000 139850146191168 torch/distributed/run.py:757] W0703 07:44:11.244000 139850146191168 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.244000 139850146191168 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.244000 139850146191168 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.326000 140629324416832 torch/distributed/run.py:757] W0703 07:44:11.326000 140629324416832 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.326000 140629324416832 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.326000 140629324416832 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.571000 140479000581952 torch/distributed/run.py:757] W0703 07:44:11.571000 140479000581952 torch/distributed/run.py:757] ***************************************** W0703 07:44:11.571000 140479000581952 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:44:11.571000 140479000581952 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 07:44:36 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config: [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: run='%date_%jobid', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: step=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: consumed_train_samples=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: benchmark_csv_path=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ignore_sanity_checks=True), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp=16, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp=4, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp_engine=, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_mode=, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_linear_async_communication=False, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: expert_parallel_size=1), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50260), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dtype=torch.bfloat16, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_revision=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_max_length=None), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoint_interval=100000, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: save_initial_state=False, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: resume_checkpoint_path=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: log_level_replica='info', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: iteration_step_info_interval=1), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: train_steps=20, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: micro_batch_size=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: batch_accumulation_per_replica=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: val_check_interval=-1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_val_batches=0, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_test_batches=0), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta1=0.9, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta2=0.95, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: torch_adam_is_fused=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: name='adamW'), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: zero_stage=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: weight_decay=0.01, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: clip_grad=1.0, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_steps=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_style='linear', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_style='linear', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_steps=19, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_starting_step=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: min_decay_lr=1e-05)), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: start_training_step=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_splits='train', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_config_name=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_overwrite_cache=False, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: text_column_name='text'), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_loading_workers=0))], [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-4_pp-16_mbz-32')), [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lighteval=None) [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Model Config: [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50260) [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Building model.. [default0]:07/03/2024 07:44:36 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Setting PP block ranks... [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Total number of parameters: 1.21G (2313.42MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Parametrizing model parameters using StandardParametrizator [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/03/2024 07:44:52 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:44:52 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: No checkpoint path provided. [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: No checkpoint path provided. [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/03/2024 07:44:52 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 07:44:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 07:44:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 07:44:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] DP Rank 0 has 46.7M out of 46.7M (100.00%) params' optimizer states [default0]:07/03/2024 07:44:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 07:44:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Using `datasets` library [default0]:07/03/2024 07:44:55 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:55 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] There are 1 training stages [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Start training] datetime: 2024-07-03 07:44:57.187851 | mbs: 32 | grad_accum: 32 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 07:44:57 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Memory usage: 448.42MiB. Peak allocated 448.42MiB. Peak reserved: 456.00MiB [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=8|TP=1|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=9|TP=2|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=8|TP=3|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=8|TP=2|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=6|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=7|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=9|TP=3|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=13|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=13|TP=2|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=12|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=12|TP=3|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=9|TP=1|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=8|TP=0|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=9|TP=0|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=1|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=1|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=2|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=3|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=3|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=10|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=6|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=2|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=0|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=1|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=1|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=3|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=2|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=6|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=0|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=7|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=6|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=15|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=15|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=7|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=10|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=11|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=10|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=2|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=14|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=15|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=14|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=10|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=13|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=11|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=11|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=4|TP=3|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=4|TP=2|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:44:57 [WARNING|DP=0|PP=4|TP=1|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=5|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:44:57 [WARNING|DP=0|PP=12|TP=2|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=15|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:44:57 [WARNING|DP=0|PP=5|TP=1|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=3|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=0|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=12|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:44:57 [WARNING|DP=0|PP=14|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=7|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=13|TP=3|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:44:57 [WARNING|DP=0|PP=5|TP=2|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:44:57 [WARNING|DP=0|PP=5|TP=3|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:44:57 [WARNING|DP=0|PP=11|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:44:57 [WARNING|DP=0|PP=14|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:45:07 [WARNING|DP=0|PP=4|TP=2|ip-26-0-164-207]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default2]:07/03/2024 07:45:07 [WARNING|DP=0|PP=4|TP=2|ip-26-0-164-207]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default2]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default2]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default3]:[rank3]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 128, in forward [default3]:[rank3]: return self.act(gate_states) * up_states [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU  has a total capacity of 79.33 GiB of which 109.94 MiB is free. Including non-PyTorch memory, this process has 79.21 GiB memory in use. Of the allocated memory 69.32 GiB is allocated by PyTorch, and 13.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default0]:[rank0]: merged_states = self.gate_up_proj(hidden_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default0]:[rank0]: return column_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default0]:[rank0]: return F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default1]:[rank1]: merged_states = self.gate_up_proj(hidden_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default1]:[rank1]: return column_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default1]:[rank1]: return F.linear(input, weight, bias) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU  has a total capacity of 79.33 GiB of which 381.94 MiB is free. Including non-PyTorch memory, this process has 78.95 GiB memory in use. Of the allocated memory 68.57 GiB is allocated by PyTorch, and 269.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default2]:[rank2]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default2]:[rank2]: merged_states = self.gate_up_proj(hidden_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default2]:[rank2]: return column_linear( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default2]:[rank2]: return F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU  has a total capacity of 79.33 GiB of which 453.94 MiB is free. Including non-PyTorch memory, this process has 78.88 GiB memory in use. Of the allocated memory 68.57 GiB is allocated by PyTorch, and 269.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:135.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: recv_activation_tensor = recv_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: pipeline_state.run_communication() [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: dist.recv( [default6]:[rank6]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank6]: dist.recv( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: pipeline_state.run_communication() [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run_communication() [default1]:[rank9]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank11]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank10]: recv_activation_tensor = recv_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank8]: pipeline_state.run_communication() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank8]: recv_activation_tensor = recv_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank8]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: return forward_call(*args, **kwargs) [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: sharded_logits = self.model( [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank12]: recv_activation_tensor = recv_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank12]: dist.recv( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6b1e298897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6b1f571c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6b1f576a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6b1f577dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f6b6b010e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f6b70057609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f6b6fe22353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6b1e298897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6b1f571c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6b1f576a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6b1f577dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7f6b6b010e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7f6b70057609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7f6b6fe22353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6b1e298897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7f6b1f1fb119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7f6b6b010e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7f6b70057609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7f6b6fe22353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default0]:[rank16]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [default1]:[rank17]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600020 milliseconds before timing out. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f74c6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84f879fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84f87a4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84f87a5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f854423ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f8549285609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f8549050353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600024 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f74c6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f84f879fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f84f87a4a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84f87a5dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f854423ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f8549285609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f8549050353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84f74c6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f84f8429119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f854423ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f8549285609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f8549050353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank18]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: output = model(**micro_batch) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank18]: sharded_logits = self.model( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank18]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank18]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank18]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank18]: pipeline_state.run_communication() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank18]: recv_activation_tensor = recv_activation() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank18]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank18]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank18]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank18]: dist.recv( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank18]: return func(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank18]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank18]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank19]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank19]: output = model(**micro_batch) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank19]: sharded_logits = self.model( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank19]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank19]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank19]: pipeline_state.run_communication() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank19]: recv_activation_tensor = recv_activation() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank19]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank19]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank19]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank19]: dist.recv( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank19]: return func(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank19]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank19]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank17]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank17]: output = model(**micro_batch) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank17]: sharded_logits = self.model( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank17]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank17]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank17]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank17]: pipeline_state.run_communication() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank17]: recv_activation_tensor = recv_activation() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank17]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank17]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank17]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank17]: dist.recv( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank17]: return func(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank17]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank17]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48256a6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f482697fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4826984a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4826985dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f487241ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4877465609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f4877230353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48256a6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f482697fc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4826984a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4826985dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7f487241ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7f4877465609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7f4877230353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48256a6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f4826609119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7f487241ee95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7f4877465609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7f4877230353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcecab9a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fcecbe73c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fcecbe78a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fcecbe79dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fcf17912e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fcf1c959609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fcf1c724353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcecab9a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fcecbe73c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fcecbe78a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fcecbe79dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fcf17912e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fcf1c959609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fcf1c724353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcecab9a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fcecbafd119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fcf17912e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fcf1c959609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fcf1c724353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: E0703 07:55:53.524000 140418890327872 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1709019) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 1709020) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 1709021) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 1709022) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 1709023) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1709024) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 1709025) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 1709026) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_07:55:53 host : ip-26-0-162-233.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 1709019) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc82b18f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc82c468c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc82c46da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc82c46edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fc877f07e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fc87cf4e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fc87cd19353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc82b18f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fc82c468c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc82c46da80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc82c46edcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7fc877f07e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7fc87cf4e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7fc87cd19353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc82b18f897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7fc82c0f2119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7fc877f07e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7fc87cf4e609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7fc87cd19353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feeea2f2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7feeeb5cbc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7feeeb5d0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7feeeb5d1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fef3706ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fef3c0b1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fef3be7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feeea2f2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7feeeb5cbc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7feeeb5d0a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7feeeb5d1dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7fef3706ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7fef3c0b1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7fef3be7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feeea2f2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7feeeb255119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fef3706ae95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fef3c0b1609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7fef3be7c353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: srun: error: ip-26-0-162-233: task 0: Exited with exit code 1 [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f589c525897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f589d7fec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f589d803a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f589d804dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f58e929de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f58ee2e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f58ee0af353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f589c525897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f589d7fec62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f589d803a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f589d804dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f58e929de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f58ee2e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f58ee0af353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f589c525897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f589d488119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f58e929de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f58ee2e4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f58ee0af353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 123, last enqueued NCCL work: 123, last completed NCCL work: 122. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f503da54897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f503ed2dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f503ed32a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f503ed33dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f508a7cce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f508f813609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f508f5de353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f503da54897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f503ed2dc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f503ed32a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f503ed33dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f508a7cce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f508f813609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f508f5de353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f503da54897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f503e9b7119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f508a7cce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f508f813609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f508f5de353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank16]: trainer.train(dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank16]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank16]: output = model(**micro_batch) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank16]: sharded_logits = self.model( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank16]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank16]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank16]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank16]: pipeline_state.run_communication() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank16]: recv_activation_tensor = recv_activation() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank16]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank16]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank16]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank16]: dist.recv( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank16]: return func(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank16]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank16]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=95, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [default4]:[rank36]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [default5]:[rank37]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default2]:[rank34]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=95, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [default4]:[rank28]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [default3]:[rank27]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=119, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [default5]:[rank29]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [default1]:[rank25]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=119, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600054 milliseconds before timing out. [default2]:[rank26]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=119, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=71, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [default7]:[rank31]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [default6]:[rank30]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [default2]:[rank42]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=71, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=71, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=71, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default7]:[rank39]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=123, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600095 milliseconds before timing out. W0703 07:55:57.351000 140473339848448 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_453711_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [default3]:[rank35]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=95, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [default6]:[rank38]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82, OpType=SEND, NumelIn=6, NumelOut=6, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [default1]:[rank33]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=95, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [default0]:[rank24]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=119, OpType=SEND, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. W0703 07:55:57.630000 139715725166336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_369706_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:55:57.706000 140429304428288 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2075126_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [default3]:[rank27]: Traceback (most recent call last): [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank27]: trainer.train(dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank27]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank27]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank27]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank27]: grad_accumulator.backward(sum(activations)) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank27]: result = loss.backward() [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank27]: torch.autograd.backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank27]: _engine_run_backward( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank27]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank27]: return user_fn(self, *args) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank27]: pipeline_state.run_communication() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank27]: send_activation() [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank27]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank27]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank27]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank27]: dist.send( [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank27]: return func(*args, **kwargs) [default3]:[rank27]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank27]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank27]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank24]: Traceback (most recent call last): [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank24]: trainer.train(dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank24]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank24]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank24]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank24]: grad_accumulator.backward(sum(activations)) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank24]: result = loss.backward() [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank24]: torch.autograd.backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank24]: _engine_run_backward( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank24]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank24]: return user_fn(self, *args) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank24]: pipeline_state.run_communication() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank24]: send_activation() [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank24]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank24]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank24]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank24]: dist.send( [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank24]: return func(*args, **kwargs) [default0]:[rank24]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank24]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank24]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank25]: Traceback (most recent call last): [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank25]: trainer.train(dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank25]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank25]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank25]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank25]: grad_accumulator.backward(sum(activations)) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank25]: result = loss.backward() [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank25]: torch.autograd.backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank25]: _engine_run_backward( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank25]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank25]: return user_fn(self, *args) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank25]: pipeline_state.run_communication() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank25]: send_activation() [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank25]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank25]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank25]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank25]: dist.send( [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank25]: return func(*args, **kwargs) [default1]:[rank25]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank25]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank25]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank26]: Traceback (most recent call last): [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank26]: trainer.train(dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank26]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank26]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank26]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank26]: grad_accumulator.backward(sum(activations)) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank26]: result = loss.backward() [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank26]: torch.autograd.backward( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank26]: _engine_run_backward( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank26]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank26]: return user_fn(self, *args) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank26]: pipeline_state.run_communication() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank26]: send_activation() [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank26]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank26]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank26]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank26]: dist.send( [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank26]: return func(*args, **kwargs) [default2]:[rank26]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank26]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank26]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default5]:[rank21]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank21]: output = model(**micro_batch) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank21]: sharded_logits = self.model( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank21]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank21]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank21]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank21]: pipeline_state.run_communication() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank21]: recv_activation_tensor = recv_activation() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank21]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank21]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank21]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank21]: dist.recv( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: return func(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank21]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank20]: Traceback (most recent call last): [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default6]:[rank22]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank22]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default4]:[rank20]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank20]: output = model(**micro_batch) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank22]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank22]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank22]: recv_activation_tensor = recv_activation() [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank20]: sharded_logits = self.model( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank20]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank20]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank20]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank22]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank20]: recv_activation_tensor = recv_activation() [default6]:[rank22]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank22]: dist.recv( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank20]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank22]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank22]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank22]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank20]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank20]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank20]: dist.recv( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank20]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank20]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. W0703 07:55:57.935000 140510727005952 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1351235_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:55:57.943000 139702462293760 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_944354_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:55:57.966000 139844485457664 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_884473_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter [default7]:[rank23]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank23]: output = model(**micro_batch) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank23]: sharded_logits = self.model( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank23]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank23]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank23]: pipeline_state.run_communication() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank23]: recv_activation_tensor = recv_activation() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank23]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank23]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank23]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank23]: dist.recv( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank23]: return func(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank23]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank23]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. W0703 07:55:58.097000 140623663683328 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-147.ec2.internal_840971_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [default3]:[rank35]: Traceback (most recent call last): [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank35]: trainer.train(dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank35]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank35]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default3]:[rank35]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank35]: grad_accumulator.backward(sum(activations)) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank35]: result = loss.backward() [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank35]: torch.autograd.backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank35]: _engine_run_backward( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default3]:[rank35]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank35]: return user_fn(self, *args) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank35]: pipeline_state.run_communication() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank35]: send_activation() [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank35]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank35]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default3]:[rank35]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default3]:[rank35]: dist.send( [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank35]: return func(*args, **kwargs) [default3]:[rank35]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank35]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank35]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank33]: Traceback (most recent call last): [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank33]: trainer.train(dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank33]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank33]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank33]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank33]: grad_accumulator.backward(sum(activations)) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank33]: result = loss.backward() [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank33]: torch.autograd.backward( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank33]: _engine_run_backward( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank33]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank33]: return user_fn(self, *args) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default1]:[rank33]: pipeline_state.run_communication() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank33]: send_activation() [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank33]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank33]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default1]:[rank33]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default1]:[rank33]: dist.send( [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank33]: return func(*args, **kwargs) [default1]:[rank33]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default1]:[rank33]: group.send([tensor], group_dst_rank, tag).wait() [default1]:[rank33]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank32]: Traceback (most recent call last): [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank32]: trainer.train(dataloader) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank32]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank32]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default0]:[rank32]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default0]:[rank32]: grad_accumulator.backward(sum(activations)) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank32]: result = loss.backward() [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default0]:[rank32]: torch.autograd.backward( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank32]: _engine_run_backward( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank32]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank32]: return user_fn(self, *args) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default0]:[rank32]: pipeline_state.run_communication() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank32]: send_activation() [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank32]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank32]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank32]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank32]: dist.send( [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank32]: return func(*args, **kwargs) [default0]:[rank32]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank32]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank32]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank34]: Traceback (most recent call last): [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank34]: trainer.train(dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank34]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank34]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank34]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank34]: grad_accumulator.backward(sum(activations)) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank34]: result = loss.backward() [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank34]: torch.autograd.backward( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank34]: _engine_run_backward( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default2]:[rank34]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank34]: return user_fn(self, *args) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank34]: pipeline_state.run_communication() [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default2]:[rank34]: send_activation() [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default2]:[rank34]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default2]:[rank34]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank34]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank34]: dist.send( [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank34]: return func(*args, **kwargs) [default2]:[rank34]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank34]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank34]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default1]:[rank41]: Traceback (most recent call last): [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: Traceback (most recent call last): [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank41]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank41]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default1]:[rank41]: grad_accumulator.backward(sum(activations)) [default2]:[rank42]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default1]:[rank41]: result = loss.backward() [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default1]:[rank41]: torch.autograd.backward( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default1]:[rank41]: _engine_run_backward( [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default1]:[rank41]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank42]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default1]:[rank41]: return user_fn(self, *args) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank42]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default2]:[rank42]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default2]:[rank42]: grad_accumulator.backward(sum(activations)) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default2]:[rank42]: result = loss.backward() [default1]:[rank41]: pipeline_state.run_communication() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default1]:[rank41]: send_activation() [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default1]:[rank41]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default1]:[rank41]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank42]: torch.autograd.backward( [default3]:[rank43]: Traceback (most recent call last): [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default2]:[rank42]: _engine_run_backward( [default3]:[rank43]: trainer.train(dataloader) [default1]:[rank41]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank40]: Traceback (most recent call last): [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank41]: dist.send( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank40]: trainer.train(dataloader) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank43]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank43]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank40]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default2]:[rank42]: return user_fn(self, *args) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank40]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank41]: return func(*args, **kwargs) [default1]:[rank41]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank43]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 295, in train_batch_iter [default1]:[rank41]: group.send([tensor], group_dst_rank, tag).wait() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank42]: pipeline_state.run_communication() [default1]:[rank41]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward [default3]:[rank43]: grad_accumulator.backward(sum(activations)) [default0]:[rank40]: self.backward(context=context, state=state, grad_accumulator=grad_accumulator) [default2]:[rank42]: send_activation() [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 86, in backward W0703 07:55:58.320000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841045 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841046 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841047 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841048 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841049 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841050 closing signal SIGTERM W0703 07:55:58.321000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 841051 closing signal SIGTERM W0703 07:55:58.346000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351310 closing signal SIGTERM W0703 07:55:58.346000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351311 closing signal SIGTERM W0703 07:55:58.346000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351312 closing signal SIGTERM W0703 07:55:58.347000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351313 closing signal SIGTERM W0703 07:55:58.350000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351314 closing signal SIGTERM W0703 07:55:58.351000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351315 closing signal SIGTERM W0703 07:55:58.351000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351316 closing signal SIGTERM W0703 07:55:58.352000 140516387739456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1351317 closing signal SIGTERM [default3]:[rank43]: result = loss.backward() [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default3]:[rank43]: torch.autograd.backward( [default0]:[rank40]: grad_accumulator.backward(sum(activations)) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/optim/gradient_accumulator.py", line 205, in backward [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default0]:[rank40]: result = loss.backward() [default2]:[rank42]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank43]: _engine_run_backward( [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward [default2]:[rank42]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank43]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank40]: torch.autograd.backward( [default2]:[rank42]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward [default3]:[rank43]: return user_fn(self, *args) [default0]:[rank40]: _engine_run_backward( [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default2]:[rank42]: dist.send( [default3]:[rank43]: pipeline_state.run_communication() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [default0]:[rank40]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank42]: return func(*args, **kwargs) [default3]:[rank43]: send_activation() [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default3]:[rank43]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default0]:[rank40]: return user_fn(self, *args) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 40, in backward [default3]:[rank43]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank40]: pipeline_state.run_communication() [default2]:[rank42]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default2]:[rank42]: group.send([tensor], group_dst_rank, tag).wait() [default3]:[rank43]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 142, in run_communication [default3]:[rank43]: dist.send( [default2]:[rank42]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank40]: send_activation() [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 22, in __call__ [default0]:[rank40]: self.p2p.send_tensors([self.activation], to_rank=self.to_rank) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 348, in send_tensors [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank43]: return func(*args, **kwargs) [default3]:[rank43]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default3]:[rank43]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank40]: futures = self.isend_tensors(tensors=tensors, to_rank=to_rank, tag=tag) [default3]:[rank43]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 295, in isend_tensors [default0]:[rank40]: self._send_meta(tensor, to_rank=to_rank, tag=tag) [default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 221, in _send_meta [default0]:[rank40]: dist.send( [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank40]: return func(*args, **kwargs) [default0]:[rank40]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1886, in send [default0]:[rank40]: group.send([tensor], group_dst_rank, tag).wait() [default0]:[rank40]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. W0703 07:55:58.382000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884546 closing signal SIGTERM W0703 07:55:58.382000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884547 closing signal SIGTERM W0703 07:55:58.382000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884548 closing signal SIGTERM W0703 07:55:58.384000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884549 closing signal SIGTERM W0703 07:55:58.384000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884550 closing signal SIGTERM W0703 07:55:58.384000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884551 closing signal SIGTERM W0703 07:55:58.384000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884552 closing signal SIGTERM W0703 07:55:58.387000 139850146191168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 884553 closing signal SIGTERM W0703 07:55:58.397000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944428 closing signal SIGTERM W0703 07:55:58.398000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944429 closing signal SIGTERM W0703 07:55:58.378000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369779 closing signal SIGTERM W0703 07:55:58.379000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369780 closing signal SIGTERM W0703 07:55:58.379000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369781 closing signal SIGTERM W0703 07:55:58.380000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369782 closing signal SIGTERM W0703 07:55:58.381000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369783 closing signal SIGTERM W0703 07:55:58.381000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369784 closing signal SIGTERM W0703 07:55:58.384000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369785 closing signal SIGTERM W0703 07:55:58.384000 139721385899840 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 369786 closing signal SIGTERM W0703 07:55:58.399000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944430 closing signal SIGTERM W0703 07:55:58.399000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944431 closing signal SIGTERM W0703 07:55:58.399000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944432 closing signal SIGTERM W0703 07:55:58.399000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944433 closing signal SIGTERM W0703 07:55:58.401000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944434 closing signal SIGTERM W0703 07:55:58.402000 139708123027264 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 944435 closing signal SIGTERM W0703 07:55:58.407000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453786 closing signal SIGTERM W0703 07:55:58.407000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453787 closing signal SIGTERM W0703 07:55:58.407000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453788 closing signal SIGTERM W0703 07:55:58.408000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453789 closing signal SIGTERM W0703 07:55:58.407000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075200 closing signal SIGTERM W0703 07:55:58.408000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075201 closing signal SIGTERM W0703 07:55:58.408000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075202 closing signal SIGTERM W0703 07:55:58.409000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453790 closing signal SIGTERM W0703 07:55:58.408000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075203 closing signal SIGTERM W0703 07:55:58.409000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075204 closing signal SIGTERM W0703 07:55:58.410000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453791 closing signal SIGTERM W0703 07:55:58.410000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453792 closing signal SIGTERM W0703 07:55:58.410000 140479000581952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 453793 closing signal SIGTERM W0703 07:55:58.413000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075205 closing signal SIGTERM W0703 07:55:58.413000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075206 closing signal SIGTERM W0703 07:55:58.413000 140434965161792 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2075207 closing signal SIGTERM E0703 07:56:00.487000 140629324416832 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 7 (pid: 841052) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 07:56:00.501000 140629324416832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_840971_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:00.531000 140629324416832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_840971_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:00.539000 140629324416832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_840971_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_07:55:58 host : ip-26-0-163-147.ec2.internal rank : 15 (local_rank: 7) exitcode : -6 (pid: 841052) error_file: traceback : Signal 6 (SIGABRT) received by PID 841052 ============================================================ srun: error: ip-26-0-163-147: task 1: Exited with exit code 1 W0703 07:56:02.356000 140473339848448 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_453711_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:02.635000 139715725166336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_369706_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:02.711000 140429304428288 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2075126_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:02.939000 140510727005952 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1351235_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:02.948000 139702462293760 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_944354_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:03.034000 139844485457664 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_884473_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:05.647000 139850146191168 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_884473_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:05.663000 139850146191168 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_884473_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 W0703 07:56:07.360000 140473339848448 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_453711_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:07.640000 139715725166336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_369706_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:07.824000 140429304428288 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2075126_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:07.954000 139702462293760 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_944354_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:08.053000 140510727005952 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1351235_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:09.259000 140479000581952 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_453711_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:09.274000 140479000581952 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_453711_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:56:09.336000 139721385899840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_369706_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:09.349000 139721385899840 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_369706_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:56:09.737000 140434965161792 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-7.ec2.internal_2075126_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. srun: error: ip-26-0-164-207: task 2: Exited with exit code 1 W0703 07:56:09.753000 140434965161792 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-7.ec2.internal_2075126_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:56:09.780000 140516387739456 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1351235_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:09.794000 140516387739456 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1351235_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 W0703 07:56:10.191000 139708123027264 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_944354_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:56:10.210000 139708123027264 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_944354_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-173-7: task 4: Exited with exit code 1 srun: error: ip-26-0-173-202: task 5: Exited with exit code 1 srun: error: ip-26-0-165-24: task 3: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.