======================== START TIME: Wed Jul 3 07:11:01 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 07:11:06.685000 139769172420416 torch/distributed/run.py:757] W0703 07:11:06.685000 139769172420416 torch/distributed/run.py:757] ***************************************** W0703 07:11:06.685000 139769172420416 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:06.685000 139769172420416 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.679000 140193977542464 torch/distributed/run.py:757] W0703 07:11:07.679000 140193977542464 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.679000 140193977542464 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:07.679000 140193977542464 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.694000 140217880921920 torch/distributed/run.py:757] W0703 07:11:07.694000 140217880921920 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.694000 140217880921920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:07.694000 140217880921920 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.824000 140116759881536 torch/distributed/run.py:757] W0703 07:11:07.824000 140116759881536 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.824000 140116759881536 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:07.824000 140116759881536 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.960000 140623436617536 torch/distributed/run.py:757] W0703 07:11:07.960000 140623436617536 torch/distributed/run.py:757] ***************************************** W0703 07:11:07.960000 140623436617536 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:07.960000 140623436617536 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.059000 139663568488256 torch/distributed/run.py:757] W0703 07:11:08.059000 139663568488256 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.059000 139663568488256 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:08.059000 139663568488256 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.236000 140607103608640 torch/distributed/run.py:757] W0703 07:11:08.236000 140607103608640 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.236000 140607103608640 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:08.236000 140607103608640 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.291000 140616723117888 torch/distributed/run.py:757] W0703 07:11:08.291000 140616723117888 torch/distributed/run.py:757] ***************************************** W0703 07:11:08.291000 140616723117888 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 07:11:08.291000 140616723117888 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 07:11:33 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config: [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: run='%date_%jobid', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: step=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: consumed_train_samples=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: benchmark_csv_path=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ignore_sanity_checks=True), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp=16, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp=4, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp_engine=, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_mode=, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_linear_async_communication=False, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: expert_parallel_size=1), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50260), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dtype=torch.bfloat16, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_revision=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_max_length=None), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoint_interval=100000, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: save_initial_state=False, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: resume_checkpoint_path=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: log_level_replica='info', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: iteration_step_info_interval=1), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: train_steps=20, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: micro_batch_size=64, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: batch_accumulation_per_replica=16, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: val_check_interval=-1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_val_batches=0, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_test_batches=0), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta1=0.9, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta2=0.95, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: torch_adam_is_fused=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: name='adamW'), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: zero_stage=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: weight_decay=0.01, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: clip_grad=1.0, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_steps=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_style='linear', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_style='linear', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_steps=19, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_starting_step=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: min_decay_lr=1e-05)), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: start_training_step=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_splits='train', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_config_name=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_overwrite_cache=False, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: text_column_name='text'), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_loading_workers=0))], [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-4_pp-16_mbz-64')), [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lighteval=None) [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Model Config: [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50260) [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Building model.. [default0]:07/03/2024 07:11:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Setting PP block ranks... [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Total number of parameters: 1.21G (2313.42MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Parametrizing model parameters using StandardParametrizator [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=2|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: Local number of parameters: 46.7M (89.10MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=1|ip-26-0-173-202]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=1|ip-26-0-173-202]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=2|ip-26-0-173-202]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=2|ip-26-0-173-202]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=0|ip-26-0-173-202]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=8|TP=3|ip-26-0-173-202]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-7]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-7]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-7]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=2|ip-26-0-165-24]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=2|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=0|ip-26-0-173-202]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=9|TP=3|ip-26-0-173-202]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=11|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/03/2024 07:11:49 [INFO|DP=0|PP=6|TP=0|ip-26-0-165-24]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=1|ip-26-0-165-24]: No checkpoint path provided. [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: Local number of parameters: 21M (40.02MiB) [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/03/2024 07:11:49 [INFO|DP=0|PP=4|TP=3|ip-26-0-164-207]: No checkpoint path provided. [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/03/2024 07:11:49 [INFO|DP=0|PP=10|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: Local number of parameters: 21M (40.02MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=7|TP=3|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=3|ip-26-0-164-207]: No checkpoint path provided. [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=2|ip-26-0-164-207]: No checkpoint path provided. [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/03/2024 07:11:49 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 07:11:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 07:11:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 07:11:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] DP Rank 0 has 46.7M out of 46.7M (100.00%) params' optimizer states [default0]:07/03/2024 07:11:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 07:11:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Using `datasets` library [default0]:07/03/2024 07:11:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 07:11:52 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] There are 1 training stages [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Start training] datetime: 2024-07-03 07:11:54.281735 | mbs: 64 | grad_accum: 16 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 07:11:54 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Memory usage: 448.42MiB. Peak allocated 448.42MiB. Peak reserved: 456.00MiB [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=1|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=0|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=2|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=2|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=0|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=1|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=14|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=15|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=8|TP=1|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=9|TP=1|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=8|TP=2|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=8|TP=3|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=15|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=15|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=15|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=11|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=10|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=3|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=12|TP=3|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=10|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=11|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=10|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=11|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=12|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=9|TP=3|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=12|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=6|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=9|TP=0|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=11|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=7|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=6|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=4|TP=3|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=1|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=7|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=5|TP=3|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=1|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=2|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=6|TP=1|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=2|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=3|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=0|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=8|TP=0|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=9|TP=2|ip-26-0-173-202]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=14|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=14|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=3|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=3|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=13|TP=2|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=12|TP=2|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=13|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 07:11:54 [WARNING|DP=0|PP=4|TP=1|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=5|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=4|TP=2|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=13|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 07:11:54 [WARNING|DP=0|PP=13|TP=3|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 07:11:54 [WARNING|DP=0|PP=6|TP=3|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 07:11:54 [WARNING|DP=0|PP=7|TP=0|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=7|TP=2|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 07:11:54 [WARNING|DP=0|PP=10|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 07:11:54 [WARNING|DP=0|PP=5|TP=1|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:11:54 [WARNING|DP=0|PP=14|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 07:11:54 [WARNING|DP=0|PP=5|TP=2|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 07:12:04 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-207]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default0]:07/03/2024 07:12:04 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-207]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default0]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default0]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default2]:[rank2]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default2]:[rank2]: merged_states = self.gate_up_proj(hidden_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default2]:[rank2]: return column_linear( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default2]:[rank2]: return F.linear(input, weight, bias) [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 573.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 67.83 GiB is allocated by PyTorch, and 915.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank3]: trainer.train(dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default3]:[rank3]: merged_states = self.gate_up_proj(hidden_states) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default3]:[rank3]: return column_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default3]:[rank3]: return F.linear(input, weight, bias) [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 741.94 MiB is free. Including non-PyTorch memory, this process has 78.59 GiB memory in use. Of the allocated memory 67.83 GiB is allocated by PyTorch, and 915.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank0]: Traceback (most recent call last): [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: trainer.train(dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: output = model(**micro_batch) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default0]:[rank0]: merged_states = self.gate_up_proj(hidden_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default0]:[rank0]: return column_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default0]:[rank0]: return F.linear(input, weight, bias) [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default1]:[rank1]: merged_states = self.gate_up_proj(hidden_states) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default1]:[rank1]: return column_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default1]:[rank1]: return F.linear(input, weight, bias) [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU  has a total capacity of 79.33 GiB of which 501.94 MiB is free. Including non-PyTorch memory, this process has 78.83 GiB memory in use. Of the allocated memory 67.83 GiB is allocated by PyTorch, and 915.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:135.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. [default5]:[rank5]: Traceback (most recent call last): [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank5]: pipeline_state.run_communication() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank5]: recv_activation_tensor = recv_activation() [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank5]: dist.recv( [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank5]: return func(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank5]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank7]: sharded_logits = self.model( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank7]: pipeline_state.run_communication() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank7]: recv_activation_tensor = recv_activation() [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank7]: dist.recv( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank7]: return func(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank7]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank6]: Traceback (most recent call last): [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank6]: pipeline_state.run_communication() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank6]: recv_activation_tensor = recv_activation() [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank6]: dist.recv( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank6]: return func(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank6]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank4]: pipeline_state.run_communication() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank4]: recv_activation_tensor = recv_activation() [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank4]: dist.recv( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank4]: return func(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank4]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: trainer.train(dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank9]: pipeline_state.run_communication() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank9]: recv_activation_tensor = recv_activation() [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank9]: dist.recv( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank9]: return func(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank9]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank11]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: trainer.train(dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank11]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank11]: sharded_logits = self.model( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank11]: pipeline_state.run_communication() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank11]: recv_activation_tensor = recv_activation() [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank11]: dist.recv( [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank11]: return func(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank10]: Traceback (most recent call last): [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: output = model(**micro_batch) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank10]: sharded_logits = self.model( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank10]: pipeline_state.run_communication() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank10]: recv_activation_tensor = recv_activation() [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank10]: dist.recv( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank10]: return func(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank8]: pipeline_state.run_communication() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank8]: recv_activation_tensor = recv_activation() [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank8]: dist.recv( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank8]: return func(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank8]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank15]: pipeline_state.run_communication() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank15]: recv_activation_tensor = recv_activation() [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank15]: dist.recv( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank15]: return func(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank15]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank13]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: output = model(**micro_batch) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank13]: pipeline_state.run_communication() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank13]: recv_activation_tensor = recv_activation() [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank13]: dist.recv( [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank13]: return func(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank14]: pipeline_state.run_communication() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank14]: recv_activation_tensor = recv_activation() [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank14]: dist.recv( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank14]: return func(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: trainer.train(dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: output = model(**micro_batch) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank12]: pipeline_state.run_communication() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank12]: recv_activation_tensor = recv_activation() [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank12]: dist.recv( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank12]: return func(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f70003c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f70016a2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f70016a7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f70016a8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f704d141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f7052188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f7051f53353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f70003c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f70016a2c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f70016a7a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f70016a8dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: + 0xd3e95 (0x7f704d141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #5: + 0x8609 (0x7f7052188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #6: clone + 0x43 (0x7f7051f53353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f70003c9897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xe32119 (0x7f700132c119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f704d141e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f7052188609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f7051f53353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a15cdd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2a16fb6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a16fbba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a16fbcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f2a62a55e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f2a67a9c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f2a67867353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a15cdd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2a16fb6c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a16fbba80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a16fbcdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: + 0xd3e95 (0x7f2a62a55e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #5: + 0x8609 (0x7f2a67a9c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #6: clone + 0x43 (0x7f2a67867353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a15cdd897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xe32119 (0x7f2a16c40119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f2a62a55e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f2a67a9c609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f2a67867353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default2]:[rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa3956b3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa39698cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa396991a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa396992dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fa3e242be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fa3e7472609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fa3e723d353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:terminate called after throwing an instance of 'c10::DistBackendError' [default2]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa3956b3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fa39698cc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa396991a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa396992dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #4: + 0xd3e95 (0x7fa3e242be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #5: + 0x8609 (0x7fa3e7472609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #6: clone + 0x43 (0x7fa3e723d353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa3956b3897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0xe32119 (0x7fa396616119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: + 0xd3e95 (0x7fa3e242be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default2]:frame #3: + 0x8609 (0x7fa3e7472609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default2]:frame #4: clone + 0x43 (0x7fa3e723d353 in /lib/x86_64-linux-gnu/libc.so.6) [default2]: [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 2] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89d3f8c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f89d5265c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f89d526aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f89d526bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f8a20d04e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f8a25d4b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f8a25b16353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [PG 4 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89d3f8c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f89d5265c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f89d526aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f89d526bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: + 0xd3e95 (0x7f8a20d04e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #5: + 0x8609 (0x7f8a25d4b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #6: clone + 0x43 (0x7f8a25b16353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f89d3f8c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xe32119 (0x7f89d4eef119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f8a20d04e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f8a25d4b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f8a25b16353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default7]:[rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fab4d624897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fab4e8fdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fab4e902a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab4e903dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fab9a39ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fab9f3e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fab9f1ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:terminate called after throwing an instance of 'c10::DistBackendError' [default7]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fab4d624897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7fab4e8fdc62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fab4e902a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab4e903dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: + 0xd3e95 (0x7fab9a39ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #5: + 0x8609 (0x7fab9f3e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #6: clone + 0x43 (0x7fab9f1ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fab4d624897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0xe32119 (0x7fab4e587119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: + 0xd3e95 (0x7fab9a39ce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default7]:frame #3: + 0x8609 (0x7fab9f3e3609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default7]:frame #4: clone + 0x43 (0x7fab9f1ae353 in /lib/x86_64-linux-gnu/libc.so.6) [default7]: [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default5]:[rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d58498897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6d59771c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6d59776a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6d59777dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f6da5210e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f6daa257609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f6daa022353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:terminate called after throwing an instance of 'c10::DistBackendError' [default5]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d58498897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f6d59771c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6d59776a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6d59777dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #4: + 0xd3e95 (0x7f6da5210e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #5: + 0x8609 (0x7f6daa257609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #6: clone + 0x43 (0x7f6daa022353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6d58498897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0xe32119 (0x7f6d593fb119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: + 0xd3e95 (0x7f6da5210e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default5]:frame #3: + 0x8609 (0x7f6daa257609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default5]:frame #4: clone + 0x43 (0x7f6daa022353 in /lib/x86_64-linux-gnu/libc.so.6) [default5]: [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default6]:[rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9fc0307897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9fc15e0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9fc15e5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9fc15e6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fa00d07fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fa0120c6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fa011e91353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:terminate called after throwing an instance of 'c10::DistBackendError' [default6]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9fc0307897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f9fc15e0c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9fc15e5a80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9fc15e6dcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #4: + 0xd3e95 (0x7fa00d07fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #5: + 0x8609 (0x7fa0120c6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #6: clone + 0x43 (0x7fa011e91353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9fc0307897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0xe32119 (0x7f9fc126a119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #2: + 0xd3e95 (0x7fa00d07fe95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default6]:frame #3: + 0x8609 (0x7fa0120c6609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default6]:frame #4: clone + 0x43 (0x7fa011e91353 in /lib/x86_64-linux-gnu/libc.so.6) [default6]: [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 4 Rank 3] Timeout at NCCL work: 63, last enqueued NCCL work: 63, last completed NCCL work: 62. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [default4]:[rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f284129c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2842575c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f284257aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f284257bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f288e014e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f289305b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f2892e26353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:terminate called after throwing an instance of 'c10::DistBackendError' [default4]: what(): [PG 4 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f284129c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2842575c62 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f284257aa80 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f284257bdcc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #4: + 0xd3e95 (0x7f288e014e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #5: + 0x8609 (0x7f289305b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #6: clone + 0x43 (0x7f2892e26353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: [default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f284129c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0xe32119 (0x7f28421ff119 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #2: + 0xd3e95 (0x7f288e014e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default4]:frame #3: + 0x8609 (0x7f289305b609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default4]:frame #4: clone + 0x43 (0x7f2892e26353 in /lib/x86_64-linux-gnu/libc.so.6) [default4]: E0703 07:22:30.293000 139769172420416 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1704039) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 1 (local_rank: 1) exitcode : 1 (pid: 1704040) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 1704041) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 1704042) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 1704043) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1704044) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 1704045) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 1704046) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_07:22:30 host : ip-26-0-162-233.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 1704039) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [default1]:[rank17]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [default2]:[rank18]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [default3]:[rank19]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. srun: error: ip-26-0-162-233: task 0: Exited with exit code 1 [default1]:[rank17]: Traceback (most recent call last): [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank17]: trainer.train(dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank17]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank17]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank17]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank17]: output = model(**micro_batch) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank17]: sharded_logits = self.model( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank17]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank17]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank17]: return self._call_impl(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank17]: return forward_call(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:[rank17]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:[rank17]: pipeline_state.run_communication() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:[rank17]: recv_activation_tensor = recv_activation() [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:[rank17]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:[rank17]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:[rank17]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]:[rank17]: dist.recv( [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default1]:[rank17]: return func(*args, **kwargs) [default1]:[rank17]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default1]:[rank17]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:[rank17]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default2]:[rank18]: Traceback (most recent call last): [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank18]: trainer.train(dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank18]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank18]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank18]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank18]: output = model(**micro_batch) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank18]: sharded_logits = self.model( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank18]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank18]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank18]: return self._call_impl(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank18]: return forward_call(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]:[rank18]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:[rank18]: pipeline_state.run_communication() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:[rank18]: recv_activation_tensor = recv_activation() [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:[rank18]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:[rank18]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:[rank18]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default2]:[rank18]: dist.recv( [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default2]:[rank18]: return func(*args, **kwargs) [default2]:[rank18]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default2]:[rank18]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:[rank18]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:[rank16]: Traceback (most recent call last): [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank16]: trainer.train(dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank16]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank16]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank16]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank16]: output = model(**micro_batch) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank16]: sharded_logits = self.model( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank16]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank16]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank16]: return self._call_impl(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank16]: return forward_call(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:[rank16]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:[rank16]: pipeline_state.run_communication() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:[rank16]: recv_activation_tensor = recv_activation() [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:[rank16]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:[rank16]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:[rank16]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]:[rank16]: dist.recv( [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default0]:[rank16]: return func(*args, **kwargs) [default0]:[rank16]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default0]:[rank16]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:[rank16]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank19]: Traceback (most recent call last): [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank19]: trainer.train(dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank19]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank19]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank19]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:[rank19]: output = model(**micro_batch) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank19]: sharded_logits = self.model( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:[rank19]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank19]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank19]: return self._call_impl(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank19]: return forward_call(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:[rank19]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:[rank19]: pipeline_state.run_communication() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:[rank19]: recv_activation_tensor = recv_activation() [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:[rank19]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:[rank19]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:[rank19]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]:[rank19]: dist.recv( [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default3]:[rank19]: return func(*args, **kwargs) [default3]:[rank19]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default3]:[rank19]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:[rank19]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:[rank23]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [default4]:[rank20]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [default5]:[rank21]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [default6]:[rank22]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=RECV, NumelIn=7, NumelOut=7, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [default7]:[rank23]: Traceback (most recent call last): [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank23]: trainer.train(dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank23]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank23]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank23]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank23]: output = model(**micro_batch) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank23]: sharded_logits = self.model( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank23]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank23]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank23]: return self._call_impl(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank23]: return forward_call(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:[rank23]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:[rank23]: pipeline_state.run_communication() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:[rank23]: recv_activation_tensor = recv_activation() [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:[rank23]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:[rank23]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]:[rank23]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]:[rank23]: dist.recv( [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default7]:[rank23]: return func(*args, **kwargs) [default7]:[rank23]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default7]:[rank23]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:[rank23]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:[rank20]: Traceback (most recent call last): [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank20]: trainer.train(dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank20]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default4]:[rank20]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank20]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank20]: output = model(**micro_batch) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank20]: sharded_logits = self.model( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank20]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank20]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank20]: return self._call_impl(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank20]: return forward_call(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:[rank20]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:[rank20]: pipeline_state.run_communication() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:[rank20]: recv_activation_tensor = recv_activation() [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:[rank20]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:[rank20]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:[rank20]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]:[rank20]: dist.recv( [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default4]:[rank20]: return func(*args, **kwargs) [default4]:[rank20]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default4]:[rank20]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:[rank20]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default5]:[rank21]: Traceback (most recent call last): [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank21]: trainer.train(dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank21]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank21]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank21]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank21]: output = model(**micro_batch) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank21]: sharded_logits = self.model( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank21]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank21]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank21]: return self._call_impl(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank21]: return forward_call(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:[rank21]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:[rank21]: pipeline_state.run_communication() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:[rank21]: recv_activation_tensor = recv_activation() [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:[rank21]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:[rank21]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:[rank21]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]:[rank21]: dist.recv( [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default5]:[rank21]: return func(*args, **kwargs) [default5]:[rank21]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default5]:[rank21]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:[rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:[rank22]: Traceback (most recent call last): [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank22]: trainer.train(dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank22]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank22]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank22]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank22]: output = model(**micro_batch) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank22]: sharded_logits = self.model( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank22]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank22]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank22]: return self._call_impl(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank22]: return forward_call(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:[rank22]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:[rank22]: pipeline_state.run_communication() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:[rank22]: recv_activation_tensor = recv_activation() [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:[rank22]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:[rank22]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:[rank22]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]:[rank22]: dist.recv( [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [default6]:[rank22]: return func(*args, **kwargs) [default6]:[rank22]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv [default6]:[rank22]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:[rank22]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. W0703 07:22:33.932000 140611062384384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_448957_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.227000 140188316808960 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_879618_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.434000 140111099148032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2070274_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.527000 140212220188416 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1346661_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.605000 139657907754752 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_364995_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.762000 140617775884032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_938434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:34.810000 140601442875136 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-147.ec2.internal_836260_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:35.057000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070349 closing signal SIGTERM W0703 07:22:35.058000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070350 closing signal SIGTERM W0703 07:22:35.058000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070351 closing signal SIGTERM W0703 07:22:35.058000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070352 closing signal SIGTERM W0703 07:22:35.061000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070353 closing signal SIGTERM W0703 07:22:35.061000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070354 closing signal SIGTERM W0703 07:22:35.062000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070355 closing signal SIGTERM W0703 07:22:35.062000 140116759881536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2070356 closing signal SIGTERM W0703 07:22:35.095000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938508 closing signal SIGTERM W0703 07:22:35.096000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938509 closing signal SIGTERM W0703 07:22:35.096000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938510 closing signal SIGTERM W0703 07:22:35.098000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938511 closing signal SIGTERM W0703 07:22:35.098000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938512 closing signal SIGTERM W0703 07:22:35.099000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938513 closing signal SIGTERM W0703 07:22:35.099000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938514 closing signal SIGTERM W0703 07:22:35.099000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346737 closing signal SIGTERM W0703 07:22:35.100000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346738 closing signal SIGTERM W0703 07:22:35.100000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346739 closing signal SIGTERM W0703 07:22:35.101000 140623436617536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 938515 closing signal SIGTERM W0703 07:22:35.100000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346740 closing signal SIGTERM W0703 07:22:35.102000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346741 closing signal SIGTERM W0703 07:22:35.102000 140607103608640 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 836335 closing signal SIGTERM W0703 07:22:35.102000 140607103608640 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 836336 closing signal SIGTERM W0703 07:22:35.103000 140607103608640 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 836340 closing signal SIGTERM W0703 07:22:35.103000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365068 closing signal SIGTERM W0703 07:22:35.103000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365069 closing signal SIGTERM W0703 07:22:35.103000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365070 closing signal SIGTERM W0703 07:22:35.103000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365071 closing signal SIGTERM W0703 07:22:35.102000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346742 closing signal SIGTERM W0703 07:22:35.104000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346743 closing signal SIGTERM W0703 07:22:35.104000 140217880921920 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1346744 closing signal SIGTERM W0703 07:22:35.104000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879691 closing signal SIGTERM W0703 07:22:35.104000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879692 closing signal SIGTERM W0703 07:22:35.105000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879693 closing signal SIGTERM W0703 07:22:35.105000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879694 closing signal SIGTERM W0703 07:22:35.106000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365072 closing signal SIGTERM W0703 07:22:35.106000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365073 closing signal SIGTERM W0703 07:22:35.106000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365074 closing signal SIGTERM W0703 07:22:35.108000 139663568488256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 365075 closing signal SIGTERM W0703 07:22:35.108000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879695 closing signal SIGTERM W0703 07:22:35.109000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879696 closing signal SIGTERM W0703 07:22:35.109000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879697 closing signal SIGTERM W0703 07:22:35.109000 140193977542464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 879698 closing signal SIGTERM W0703 07:22:35.184000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449031 closing signal SIGTERM W0703 07:22:35.185000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449032 closing signal SIGTERM W0703 07:22:35.185000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449033 closing signal SIGTERM W0703 07:22:35.187000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449034 closing signal SIGTERM W0703 07:22:35.187000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449035 closing signal SIGTERM W0703 07:22:35.187000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449036 closing signal SIGTERM W0703 07:22:35.187000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449037 closing signal SIGTERM W0703 07:22:35.189000 140616723117888 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 449038 closing signal SIGTERM E0703 07:22:36.021000 140607103608640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 836334) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 07:22:36.033000 140607103608640 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_836260_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:36.062000 140607103608640 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_836260_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:36.082000 140607103608640 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_836260_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_07:22:35 host : ip-26-0-163-147.ec2.internal rank : 11 (local_rank: 3) exitcode : -6 (pid: 836337) error_file: traceback : Signal 6 (SIGABRT) received by PID 836337 [2]: time : 2024-07-03_07:22:35 host : ip-26-0-163-147.ec2.internal rank : 12 (local_rank: 4) exitcode : -6 (pid: 836338) error_file: traceback : Signal 6 (SIGABRT) received by PID 836338 [3]: time : 2024-07-03_07:22:35 host : ip-26-0-163-147.ec2.internal rank : 13 (local_rank: 5) exitcode : -6 (pid: 836339) error_file: traceback : Signal 6 (SIGABRT) received by PID 836339 [4]: time : 2024-07-03_07:22:35 host : ip-26-0-163-147.ec2.internal rank : 15 (local_rank: 7) exitcode : -6 (pid: 836341) error_file: traceback : Signal 6 (SIGABRT) received by PID 836341 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_07:22:35 host : ip-26-0-163-147.ec2.internal rank : 8 (local_rank: 0) exitcode : -6 (pid: 836334) error_file: traceback : Signal 6 (SIGABRT) received by PID 836334 ============================================================ srun: error: ip-26-0-163-147: task 1: Exited with exit code 1 W0703 07:22:38.937000 140611062384384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_448957_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:39.231000 140188316808960 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_879618_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:39.438000 140111099148032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2070274_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:39.531000 140212220188416 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1346661_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:39.666000 139657907754752 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_364995_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:39.767000 140617775884032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_938434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:42.329000 140193977542464 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_879618_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:42.342000 140193977542464 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_879618_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 W0703 07:22:43.942000 140611062384384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_448957_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:44.443000 140111099148032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-7.ec2.internal_2070274_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:44.535000 140212220188416 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-202.ec2.internal_1346661_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:44.631000 140616723117888 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_448957_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:44.642000 140616723117888 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_448957_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:22:44.671000 139657907754752 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_364995_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:44.771000 140617775884032 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_938434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. srun: error: ip-26-0-164-207: task 2: Exited with exit code 1 W0703 07:22:46.285000 140116759881536 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-7.ec2.internal_2070274_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:46.296000 140116759881536 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-7.ec2.internal_2070274_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:22:46.534000 139663568488256 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_364995_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:46.545000 139663568488256 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_364995_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:22:46.751000 140623436617536 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_938434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:46.766000 140623436617536 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_938434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 07:22:46.813000 140217880921920 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1346661_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 07:22:46.824000 140217880921920 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-202.ec2.internal_1346661_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-173-7: task 4: Exited with exit code 1 srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 srun: error: ip-26-0-165-24: task 3: Exited with exit code 1 srun: error: ip-26-0-173-202: task 5: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.